UltraSPARC ™-IIi
User’s Manual
Sun Microelectronics
901 San Antonio Road Palo Alto, CA 94303 USA 800 681-8845 http://www.sun.com/microelectronics Part No.: 805-0087-01
Copyright © 1997 Sun Microsystems, Inc. All Rights reserved. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED"AS IS" WITHOUT ANY EXPRESS REPRESENTATIONS OR WARRANTIES. IN ADDITION, SUN MICROSYSTEMS, INC. DISCLAIMS ALL IMPLIED REPRESENTATIONS AND WARRANTIES, INCLUDING ANY WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT OF THIRD PARTY INTELLECTUAL PROPERTY RIGHTS. This document contains proprietary information of Sun Microsystems, Inc. or under license from third parties. No part of this document may be reproduced in any form or by any means or transferred to any third party without the prior written consent of Sun Microsystems, Inc. Sun, Sun Microsystems and the Sun Logo are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The information contained in this document is not designed or intended for use in on-line control of aircraft, air traffic, aircraft navigation or aircraft communications; or in the design, construction, operation or maintenance of any nuclear facility. Sun disclaims any express or implied warranty of fitness for such uses.
Contents
Preface
xxxvii
Overview xxxvii A Brief History of SPARC and PCI xxxviii How to Use This Book xxxix Textual Conventions xxxix Contents xl 1. UltraSPARC-IIi Basics 1.1 1.2 1.3 Overview
1 2 3 5 6 6 1
Design Philosophy
Component Description 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.3.7 1.3.8 1.3.9
PCI Bus Module (PBM)
IO Memory Management Unit (IOM) External Cache Control Unit (ECU) Memory Controller Unit (MCU) Instruction Cache (I-cache) Data Cache (D-cache)
9 9 8 7
Prefetch and Dispatch Unit (PDU)
Translation Lookaside Buffers (iTLB and dTLB) Integer Execution Unit (IEU)
9
9
iii
1.3.10 Floating-Point Unit (FPU) 1.3.11 Graphics Unit (GRU) 1.3.12 Load/Store Unit (LSU)
10 11
10
1.3.13 Phase Locked Loops (PLL) 1.3.14 Signals 2. Processor Pipeline 2.1 2.2 Introductions Pipeline Stages 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.8 2.2.9 3.
11 13 13 14 15
11
Stage 1: Fetch (F) Stage
Stage 2: Decode (D) Stage Stage 3: Grouping (G) Stage Stage 4: Execution (E) Stage
15 15 16 16
Stage 5: Cache Access (C) Stage Stage 6: N1 Stage Stage 7: N2 Stage Stage 8: N3 Stage
16 17 17 17
Stage 9: Write (W) Stage
19
Cache Organization 3.1 Introduction 3.1.1 3.1.2
19
Level-1 Caches
19 20
Level-2 PIPT External Cache (E-cache)
23
4.
Overview of I and D-MMUs 4.1 4.2 Introduction
23
Virtual Address Translation
27
23
5.
UltraSPARC-IIi in a System 5.1 5.2
A Hardware Reference Platform Memory Subsystem
28
27
iv
UltraSPARC-IIi User’s Manual • October 1997
5.2.1 5.2.2 5.2.3 5.3 5.4 5.5 5.6 5.7 6.
E-cache
29 30
DRAM Memory Transceivers
31
PCI Interface—Advanced PCI Bridge RIC Chip
33 33 34
31
UPA64S interface (FFB) Alternate RMTV support Power Management
34
Address Spaces, ASIs, ASRs, and Traps 6.1 6.2 Overview
35 35 36
35
Physical Address Space 6.2.1 6.2.2 6.2.3 6.2.4 Port Allocations
Memory DIMM requirements PCI Address Assignments Probing the address space
39 40 38 39
36
6.3
Alternate Address Spaces 6.3.1 6.3.2
Supported SPARC-V9 ASIs
UltraSPARC-IIi (Non-SPARC-V9) ASI Extensions
41 48
6.4 6.5
Summary of CSRs mapped to the Noncacheable address space Ancillary State Registers 6.5.1 6.5.2 6.5.3 Overview of ASRs
52 52 53
SPARC-V9-Defined ASRs Non-SPARC-V9 ASRs
54
6.6 6.7 7.
Other UltraSPARC-IIi Registers Supported Traps
56 59
55
UltraSPARC-IIi Memory System 7.1 7.2 7.3 Overview
59
10-bit Column Addressing 11-bit Column Addressing
62 65
Contents v
8.
Cache and Memory Interactions 8.1 8.2 Introduction
67 67
67
Cache Flushing 8.2.1 8.2.2 8.2.3
Address Aliasing Flushing
68 69
Committing Block Store Flushing Displacement Flushing
69 69
8.3
Memory Accesses and Cacheability 8.3.1 8.3.2 8.3.3 8.3.4 8.3.5 8.3.6 8.3.7 8.3.8 8.3.9 Coherence Domains
70
Memory Synchronization: MEMBAR and FLUSH Atomic Operations Non-Faulting Load
74 76 76 78
72
PREFETCH Instructions Block Loads and Stores
I/O (PCI or UPA64S) and Accesses with Side-effects Instruction Prefetch to Side-Effect Locations Instruction Prefetch When Exiting RED_state
79 79 79
78
8.3.10 UltraSPARC-IIi Internal ASIs 8.4 8.5 Load Buffer Store Buffer 8.5.1 8.5.2 9.
80 81 81 81
Stores Delayed by Loads Store Buffer Compression
83 83
PCI Bus Interface 9.1 Introduction 9.1.1 9.1.2 9.2
Supported PCI features:
83 84
Unsupported PCI features:
84 84
PCI Bus Operations 9.2.1 9.2.2 9.2.3
Basic Read/Write Cycles
Transaction Termination Behavior Addressing Modes
85
84
vi
UltraSPARC-IIi User’s Manual • October 1997
9.2.4 9.2.5 9.2.6 9.2.7 9.2.8 9.3
Configuration Cycles Special Cycles
85
85
PCI INT_ACK Generation Exclusive Access
86
86
Fast Back-to-Back Cycles
87 87 87 89 89
86
Functional Topics 9.3.1 9.3.2 PCI Arbiter
PCI Commands
9.4
Little-endian Support 9.4.1 9.4.2 9.4.3 Endian-ness
Big- and Little-endian regions Specific Cases
95 96 96 96 98 98 99 92
90
10.
UltraSPARC-IIi IOM 10.1 Block Diagram
10.2 TLB Entry Formats
10.2.1 TLB CAM Tag 10.2.2 TLB RAM Data
10.3 DMA Operational Modes 10.3.1 Translation Mode 10.3.2 Bypass Mode
100
10.3.3 Pass-through Mode 10.4 Translation Storage Buffer
101 101 102
10.4.1 Translation Table Entry 10.4.2 TSB Lookup 10.5 PIO Operations
104 104 102
10.6 Translation Errors 10.7 IOM Demap
105
10.8 Pseudo-LRU replacement algorithm 10.9 TLB Initialization and Diagnostics
105 106
Contents
vii
11.
Interrupt Handling 11.1 Overview
107
107
11.1.1 Mondo Dispatch Overview 11.2 Mondo Unit Functional Description 11.2.1 Mondo Vectors. 11.3 Details
112 113 108
108 108
11.4 Interrupt Initialization 11.5 Interrupt Servicing 11.6 Interrupt Sources
114 114
11.6.1 PCI Interrupts
115 115
11.6.2 On-board Device Interrupts 11.6.3 Graphic Interrupt 11.6.4 Error Interrupts
115 115 115
11.6.5 Software Interrupts 11.7 Interrupt Concentrator
116
11.8 UltraSPARC-IIi Interrupt Handling 11.8.1 Interrupt States
117 117 118
117
11.8.2 Interrupt Prioritizing 11.8.3 Interrupt Dispatching 11.9 Interrupt Global Registers 11.10 Interrupt ASI Registers
120
121 121
11.10.1 Outgoing Interrupt Vector Data 11.10.2 Interrupt Vector Dispatch
121
11.10.3 Interrupt Vector Dispatch Status Register 11.10.4 Incoming Interrupt Vector Data 11.10.5 Interrupt Vector Receive
123 124 122
122
11.11 Software Interrupt (SOFTINT) Register
viii
UltraSPARC-IIi User’s Manual • October 1997
12. 13.
Instruction Set Summary
127 135
VIS™ and Additional Instructions 13.1 Introduction
135 135
13.2 Graphics Data Formats 13.2.1 8-Bit Format
135
13.2.2 Fixed Data Formats
136 137
13.3 Graphics Status Register (GSR) 13.4 Graphics Instructions 13.4.1 Opcode Format
138 138
13.4.2 Partitioned Add/Subtract Instructions 13.4.3 Pixel Formatting Instructions
140 147
139
13.4.4 Partitioned Multiply Instructions 13.4.5 Alignment Instructions
154 156 159 161
13.4.6 Logical Operate Instructions 13.4.7 Pixel Compare Instructions 13.4.8 Edge Handling Instructions
13.4.9 Pixel Component Distance (PDIST)
164 165
13.4.10 Three-Dimensional Array Addressing Instructions 13.5 Memory Access Instructions
168 168
13.5.1 Partial Store Instructions
13.5.2 Short Floating-Point Load and Store Instructions 13.5.3 Block Load and Store Instructions 13.6 Additional Instructions
178 178 172
170
13.6.1 Atomic Quad Load 13.6.2 SHUTDOWN 14.
179
Implementation Dependencies
181 181 181 181
Contents ix
14.1 SPARC-V9 General Information
14.1.1 Level-2 Compliance (Impdep #1)
14.1.2 Unimplemented Opcodes, ASIs, and ILLTRAP
14.1.3 Trap Levels (Impdep #37, 38, 39, 40, 114, 115) 14.1.4 Alternate RSTV support
182
182
14.1.5 Trap Handling (Impdep #16, 32, 33, 35, 36, 44) 14.1.6 SIGM Support (Impdep #116) 14.1.7 44-bit Virtual Address Space 14.1.8 TICK Register
185 186 183 184
183
14.1.9 Population Count Instruction (POPC) 14.1.10 Secure Software
186 186
14.1.11 Address Masking (Impdep #125) 14.2 SPARC-V9 Integer Operations
187 187
14.2.1 Integer Register File and Window Control Registers (Impdep #2) 14.2.2 Clean Window Handling (Impdep #102) 14.2.3 Integer Multiply and Divide
187 188 187
14.2.4 Version Register (Impdep #2, 13, 101, 104) 14.3 SPARC-V9 Floating-Point Operations
189
14.3.1 Subnormal Operands & Results; Non-standard Operation 14.3.2 Overflow, Underflow, and Inexact Traps (Impdep #3, 55) 14.3.3 Quad-Precision Floating-Point Operations (Impdep #3)
189 190
191 192 193
14.3.4 Floating Point Upper and Lower Dirty Bits in FPRS Register
14.3.5 Floating-Point Status Register (FSR) (Impdep #13, 19, 22, 23, 24) 14.4 SPARC-V9 Memory-Related Operations
196 196
14.4.1 Load/Store Alternate Address Space (Impdep #5, 29, 30) 14.4.2 Load/Store ASR (Impdep #6,7,8,9, 47, 48) 14.4.3 MMU Implementation (Impdep #41)
196 196 196
14.4.4 FLUSH and Self-Modifying Code (Impdep #122) 14.4.5 PREFETCH{A} (Impdep #103, 117)
197
14.4.6 Non-faulting Load and MMU Disable (Impdep #117) 14.4.7 LDD/STD Handling (Impdep #107, 108)
198
197
14.4.8 FP mem_address_not_aligned (Impdep #109, 110, 111, 112)
198
x
UltraSPARC-IIi User’s Manual • October 1997
14.4.9 Supported Memory Models (Impdep #113, 121) 14.4.10 I/O Operations (Impdep #118, 123) 14.5 Non-SPARC-V9 Extensions
199 198
198
14.5.1 Per-Processor TICK Compare Field of TICK Register 14.5.2 Cache Sub-system
199 199
199
14.5.3 Memory Management Unit 14.5.4 Error Handling
200
14.5.5 Block Memory Operations 14.5.6 Partial Stores
200
200
14.5.7 Short Floating-Point Loads and Stores 14.5.8 Atomic Quad-load
200 200
200
14.5.9 PSTATE Extensions: Trap Globals 14.5.10 Interrupt Vector Handling
202
14.5.11 Power Down Support and the SHUTDOWN Instruction 14.5.12 UltraSPARC-IIi Instruction Set Extensions (Impdep #106) 14.5.13 Performance Instrumentation
203 203
203 203
14.5.14 Debug and Diagnostics Support 15. MMU Internal Architecture 15.1 Introduction
205 205 208 205
15.2 Translation Table Entry (TTE)
15.3 Translation Storage Buffer (TSB)
15.3.1 Hardware Support for TSB Access
209 211
15.3.2 Alternate Global Selection During TLB Misses 15.4 MMU-Related Faults and Traps
211 212 212
15.4.1 Instruction_access_MMU_miss Trap 15.4.2 Instruction_access_exception Trap 15.4.3 Data_access_MMU_miss Trap 15.4.4 Data_access_exception Trap 15.4.5 Data_access_protection Trap
212 212 213
Contents
xi
15.4.6 Privileged_action Trap 15.4.7 Watchpoint Trap
213
213
15.4.8 Mem_address_not_aligned Trap 15.5 MMU Operation Summary
214
213
15.6 ASI Value, Context, and Endianness Selection for Translation 15.7 MMU Behavior During Reset, MMU Disable, and RED_state 15.8 Compliance with the SPARC-V9 Annex F
220 220
216 218
15.9 MMU Internal Registers and ASI Operations 15.9.1 Accessing MMU Registers
220 222
15.9.2 I-/D-TSB Tag Target Registers 15.9.3 Context Registers
222
15.9.4 I-/D-MMU Synchronous Fault Status Registers (SFSR)
223 225
15.9.5 I-/D-MMU Synchronous Fault Address Registers (SFAR) 15.9.6 I-/D- Translation Storage Buffer (TSB) Registers 15.9.7 I-/D-TLB Tag Access Registers
227 226
15.9.8 I-/D-TSB 8 kB/64 kB Pointer and Direct Pointer Registers 15.9.9 I-/D-TLB Data-In/Data-Access/Tag-Read Registers 15.9.10 I-/D-MMU Demap
231 233 233 229
228
15.9.11 I-/D-Demap Page (Type=0)
15.9.12 I-/D-Demap Context (Type=1) 15.10 MMU Bypass Mode 15.11 TLB Hardware
234 234 235 234
15.11.1 TLB Operations
15.11.2 TLB Replacement Policy
15.11.3 TSB Pointer Logic Hardware Description 16. Error Handling
239 240
235
16.1 System Fatal Errors 16.2 Deferred Errors
240
16.2.1 Probing PCI during boot using deferred errors
241
xii
UltraSPARC-IIi User’s Manual • October 1997
16.2.2 General software for handling deferred errors 16.3 Disrupting Errors
242 243 243 243
241
16.4 E-cache, Memory, and Bus Errors 16.4.1 E-cache Tag Parity Error 16.4.2 E-cache Data Parity Error 16.4.3 DRAM ECC Error 16.4.4 CE/UE 16.4.5 Timeout
244 244 245 245 244
16.4.6 PCI Timeout
16.4.7 PCI Data Parity Error 16.4.8 PCI Target-Abort 16.4.9 DMA ECC Errors
246 247
16.4.10 IOMMU Translation Error 16.4.11 PCI Address Parity Error 16.4.12 PCI System Error
248 249
247 247
16.5 Summary of Error Reporting
16.6 E-cache Unit (ECU) Error Registers 16.6.1 E-cache Error Enable Register
250 250 251 254
16.6.2 ECU Asynchronous Fault Status Register 16.6.3 ECU Asynchronous Fault Address Register 16.6.4 SDBH Error Register 16.6.5 SDBL Error Register
255 256 257 257 258
16.6.6 SDBH Control Register 16.6.7 SDBL Control Register 16.6.8 PCI Unit Error Registers 16.7 Overwrite Policy
258
16.7.1 AFAR Overwrite Policy
258 258
16.7.2 AFSR Parity Syndrome (P_SYND) Overwrite Policy 16.7.3 AFSR E-cache Tag Parity (ETS) Overwrite Policy 16.7.4 SDB ECC Syndrome (E_SYND) Overwrite Policy
259
259
Contents
xiii
17.
Reset and RED_state 17.1 Overview 17.2 Resets
262 261
261
17.2.1 Power-on Reset (POR) and Initialization 17.2.2 Externally Initiated Reset (XIR)
263
262
17.2.3 Watchdog Reset (WDR) and error_state 17.2.4 Software-Initiated Reset (SIR) 17.2.5 Hardware Reset Sources 17.2.6 Software Reset 17.2.7 Effects of Resets 17.3 RED_state
268 268 265 266 264 263
263
17.3.1 Description of RED_state 17.3.2 RED_state Trap Vector
271 272
17.4 Machine State after Reset and in RED_state 18. MCU Control and Status Registers
277 278
18.1 FFB_Config Register (0x1FE.0000.F000)
18.2 Mem_Control0 Register (0x1FE.0000.F010) 18.3 Mem_Control1 Register (0x1FE.0000.F018) 18.4 Programming Mem_Control1 18.5 UPA Configuration Register 19.
287 289 291
279 282
UltraSPARC-IIi PCI Control and Status 19.1 Terms and Abbreviations Used 19.2 Access Restrictions
292 292 291
19.3 PCI Bus Module Registers
19.3.1 PCI Configuration Space 19.3.2 IOMMU Registers 19.3.3 Interrupt Registers
308 313
300
19.3.4 PCI INT_ACK Generation 19.4 PCI Address Space
xiv UltraSPARC-IIi User’s Manual • October 1997
322
323
19.4.1 PCI Address Space—PIO 19.4.2 PCI Address Space—DMA 19.4.3 DMA Error Registers 20. SPARC-V9 Memory Models 20.1 Overview
335 336 335 330
324 327
20.2 Supported Memory Models 20.2.1 TSO 20.2.2 PSO 20.2.3 RMO 21.
336 337 337 339
Code Generation Guidelines
21.1 Hardware / Software Synergy 21.2 Instruction Stream Issues
339
339
21.2.1 UltraSPARC-IIi Front End 21.2.2 Instruction Alignment 21.2.3 I-cache Timing
343 340
339
21.2.4 Executing Code Out of the E-cache 21.2.5 uTLB and iTLB Misses 21.2.6 Branch Prediction 21.2.7 I-cache Utilization
345 347 348 348 349 345
344
21.2.8 Handling of CTI couples 21.2.9 Mispredicted Branches
21.2.10 Return Address Stack (RAS) 21.3 Data Stream Issues
350 350
21.3.1 D-cache Organization 21.3.2 D-cache Timing 21.3.3 Data Alignment
350 351
21.3.4 Direct-Mapped Cache Considerations 21.3.5 D-cache Miss, E-cache Hit Timing 21.3.6 Scheduling for the E-cache
353 352
352
Contents
xv
21.3.7 Store Buffer Considerations
355 356
21.3.8 Read-After-Write and Write-After-Read Hazards 21.3.9 Non-Faulting Loads 22. Grouping Rules and Stalls 22.1 Introduction
359 359 360 359 357
22.1.1 Textual Conventions 22.1.2 Example Conventions 22.2 General Grouping Rules 22.3 Instruction Availability
360 361
22.4 Single Group Instructions
361 362
22.5 Integer Execution Unit (IEU) Instructions 22.5.1 Multi-Cycle IEU Instructions 22.5.2 IEU Dependencies
363 365 362
22.6 Control Transfer Instructions
22.6.1 Control Transfer Dependencies 22.7 Load / Store Instructions
369
366
22.7.1 Load Dependencies and Interaction with Cache Hierarchy 22.7.2 Store Dependencies
373 374 374
370
22.8 Floating-Point and Graphic Instructions
22.8.1 Floating-Point and Graphics Instruction Dependencies 22.8.2 Floating-Point and Graphics Instruction Latencies A. Debug and Diagnostics Support A.1 Overview
381 381 381 378
A.2 Diagnostics Control and Accesses A.3 Dispatch Control Register A.4 Floating-Point Control A.5 Watchpoint Support A.5.1 A.5.2
xvi
382
382
382 383
Instruction Breakpoint Data Watchpoint
383
UltraSPARC-IIi User’s Manual • October 1997
A.5.3 A.5.4
Virtual Address (VA) Data Watchpoint Register Physical Address Data Watchpoint Register
384 385 385 385 386 387 388 389 389 391 384
384
A.6 LSU_Control_Register A.6.1 A.6.2 A.6.3 A.6.4 Cache Control MMU Control Parity Control
Watchpoint Control
A.7 I-cache Diagnostic Accesses A.7.1 A.7.2 A.7.3 A.7.4
I-cache Instruction Fields I-cache Tag/Valid Fields I-cache Predecode Field
I-cache LRU/BRPD/SP/NFA Fields
392 393 393
A.8 D-cache Diagnostic Accesses A.8.1 A.8.2 D-cache Data Field
D-cache Tag/Valid Fields
394 394
A.9 E-cache Diagnostics Accesses A.9.1 A.9.2 A.9.3 E-cache Data Fields
E-cache Tag/State/Parity Field Diagnostic Accesses E-cache Tag/State/Parity Data Accesses
397 396
395
A.10 Memory Probing and Initialization A.10.1 Initialization
397 397
A.10.2 Memory Probing
A.10.3 Detection of DIMM presence
398 398 399
A.10.4 Determination of DIMM pair Size
A.10.5 Determination of DIMM pair size equivalence A.10.6 11-bit Column Address Mode A.10.7 Banked DIMMs
399 400 399
A.10.8 Completion of probing
Contents
xvii
B. Performance Instrumentation B.1 B.2 B.3 B.4 Overview
401
401
Performance Control and Counters PCR/PIC Accesses
402
401
Performance Instrumentation Counter Events B.4.1 B.4.2 B.4.3 B.4.4 B.4.5 Instruction Execution Rates
403 404
403
Grouping (G) Stage Stall Counts Load Use Stall Counts Cache Access Statistics
404 405
PCR.S0 and PCR.S1 Encoding
409
407
C. IEEE 1149.1 Scan Interface C.1 Introduction C.2 Interface
409 409
C.3 Test Access Port Controller C.3.1 C.3.2 C.3.3 C.3.4 C.3.5 C.3.6 C.3.7 C.3.8 C.3.9 TEST-LOGIC-RESET RUN-TEST/IDLE SELECT-DR-SCAN SELECT-IR-SCAN CAPTURE IR/DR SHIFT IR/DR EXIT-1 IR/DR PAUSE IR/DR EXIT-2 IR/DR
413 413 413 413
410 412
412 412 412 412
C.3.10 UPDATE IR/DR C.4 Instruction Register C.5 Instructions C.5.1 C.5.2
414 414
413
Public Instructions Private Instructions
415 416 416
C.6 Public Test Data Registers
xviii
UltraSPARC-IIi User’s Manual • October 1997
C.6.1 C.6.2 C.6.3 C.6.4
Device ID Register Bypass Register
416
417 417 417
Boundary Scan Register Private Data Registers
419
D. ECC Specification D.1 ECC Code E. UPA64S interface E.1 UPA64S Bus E.1.1 E.1.2 E.2
419 421 421 421
Data Bus (MEMDATA) SYSADDR Bus
422
UPA64S Transaction Overview E.2.1 E.2.2 E.2.3 E.2.4
422 422 422
NonCachedRead (P_NCRD_REQ)
NonCachedBlockRead (P_NCBRD_REQ) NonCachedWrite (P_NCWR_REQ)
423
NonCachedBlockWrite (P_NCBWR_REQ)
423
423
E.3
P_REPLY and S_REPLY E.3.1 E.3.2 E.3.3 P_REPLY S_REPLY
423 424
P_REPLY and S_REPLY Timing
426 428
E.4
Issues with Multiple Outstanding Transactions E.4.1 E.4.2 E.4.3 Strong Ordering
428
Limiting the Number of Transactions S_REPLY assertion
428
428
E.5
UPA64S Packet Formats E.5.1 E.5.2 Request Packets
429 429 429
Packet Description
F. Pin and Signal Descriptions F.1 Introduction
433
433
Contents
xix
F.2
Pin Interface Signal Descriptions F.2.1 F.2.2 F.2.3 F.2.4 F.2.5 F.2.6 F.2.7 F.2.8 F.2.9
434 434 436
External Cache (E-cache) Interface
Internal, SRAM, and UPA Clock Interface PCI Clock Interface
437 438 439
JTAG/Debug Interface Initialization Interface PCI interface
440 441
Interrupt Interface
Memory and Transceiver Interface UPA64S Interface
445 445 453 453 443
442
G. ASI Names
G.1 Introduction
H. Event Ordering on UltraSPARC-IIi
H.1 Highlight of US-IIi specific issues
H.2 Review of SPARC V9 load/store ordering H.2.1
454 456
Ordering load/store Activity Out To The Primary PCI bus
457 457
I. Observability Bus I.1
Theory of Operation I.1.1 I.1.2 I.1.3 I.1.4 I.1.5 Muxing
457
Dispatch Control Register Timing
459 459
458
Signal List
Other UltraSPARC-IIi Debug Features
467
466
J. List of Compatibility Notes K. Errata
471 471
K.1 Overview.
K.2 Errata Created by UltraSPARC-I
471
xx
UltraSPARC-IIi User’s Manual • October 1997
K.3 Errata created by UltraSPARC-IIi Glossary
479 485
478
Bibliography Index
489
Contents
xxi
xxii
UltraSPARC-IIi User’s Manual • October 1997
Figures
FIGURE 1-1 FIGURE 1-2 FIGURE 1-3 FIGURE 2-1 FIGURE 2-2 FIGURE 4-1 FIGURE 4-2
UltraSPARC-IIi Block Diagram 4 UltraSPARC-IIi PCI and MCU Subsystems 5 UltraSPARC-IIi Memory—Typical Configuration UltraSPARC-IIi Pipeline Stages (Simplified) UltraSPARC-IIi Pipeline Stages (Detail) 14 24 13 8
Virtual-to-physical Address Translation for all Page Sizes
UltraSPARC-IIi 44-bit Virtual Address Space, with Hole (Same as FIGURE 14-2 on page 184) 25 Software View of the UltraSPARC-IIi MMU 26
FIGURE 4-3 FIGURE 5-1 FIGURE 5-2 FIGURE 5-3 FIGURE 7-1 FIGURE 7-2 FIGURE 7-3 FIGURE 7-4 FIGURE 9-1 FIGURE 10-1 FIGURE 10-2 FIGURE 10-3
Overview of UltraSPARC-IIi Reference Platform 28 A Typical Subsystem: UltraSPARC-IIi and Memory—Simplified Block Diagram 29 UltraSPARC-IIi System Implementation Example 32
Memory RAS Wiring with 10-bit Column, 8-128 MB DIMM 60 Memory RAS Wiring with 11-bit Column, 8-256MB DIMM 61 62 65
UltraSPARC-IIi Memory Addressing for 10-bit Column Address Mode UltraSPARC-IIi Memory Addressing for 11-bit Column Address Mode UltraSPARC-IIi Byte Twisting 91
IOM Top Level Block Diagram 96 TLB CAM Tag Format 96
TLB RAM Data Format 98
xxiii
FIGURE 10-4 FIGURE 10-5 FIGURE 10-6 FIGURE 10-7 FIGURE 10-8 FIGURE 11-1 FIGURE 11-2 FIGURE 11-3 FIGURE 11-4 FIGURE 13-1 FIGURE 13-2 FIGURE 13-3 FIGURE 13-4 FIGURE 13-5 FIGURE 13-6 FIGURE 13-7 FIGURE 13-8 FIGURE 13-9 FIGURE 13-10 FIGURE 13-11 FIGURE 13-12 FIGURE 13-13 FIGURE 13-14 FIGURE 13-15 FIGURE 13-16 FIGURE 13-17 FIGURE 13-18 FIGURE 13-19
Virtual to Physical Address Translation for 8K Page Size 99 Virtual to Physical Address Translation for 64K Page Size 100
Physical Address Formation in Bypass Mode (8K and 64K) 100 Physical Address Formation in Pass-through Mode (8K and 64K) 101 Computation of TTE Entry Address 103 Mondo Vector Format Full INR Contents 110 111 111 109
Partial INR Contents Interrupt Concentrator
Graphics Fixed Data Formats 136 RDASR Format 137
WRASR Format 137 GSR Format (ASR 1016) 138
Graphics Instruction Format (3) 138 Partitioned Add/Subtract Instruction Format (3) Pixel Formatting Instruction Format (3) 140 FPACK16 Operation FPACK32 Operation 142 144 139
FPACKFIX Operation 145 FEXPAND Operation FPMERGE Operation 146 147 147
Partitioned Multiply Instruction Format (3) FMUL8x16 Operation 149 FMUL8x16AU Operation 150 FMUL8x16AL Operation 150
FMUL8SUx16 Operation 151 FMUL8ULx16 Operation 152 FMULD8SUx16 Operation 152
xxiv
UltraSPARC-IIi User’s Manual • October 1997
FIGURE 13-20 FIGURE 13-21 FIGURE 13-22 FIGURE 13-23 FIGURE 13-24 FIGURE 13-25 FIGURE 13-26 FIGURE 13-27 FIGURE 13-28 FIGURE 13-29 FIGURE 13-30 FIGURE 13-31 FIGURE 13-32 FIGURE 13-33 FIGURE 13-34 FIGURE 13-35 FIGURE 14-1 FIGURE 14-2
FMULD8ULx16 Operation 153 Alignment Instruction Format (3) 154
Logical Operate Instruction Format (3) 157 Pixel Compare Instruction Format (3) 159
Edge Handling Instruction Format (3) 161 Pixel Component Distance Format (3) 164 Three-Dimensional Array Addressing Instruction Format (3) 165 Three Dimensional Array Fixed-Point Address Format 166
Three Dimensional Array Blocked-Address Format (Array8) 166 Three Dimensional Array Blocked-Address Format (Array16) Three Dimensional Array Blocked-Address Format (Array32) Partial Store Format (3) 168 Format (3) LDDFA: Format (3) STDFA: Format (3) LDDA 172 173 178 179 166 167
SHUTDOWN Instruction Format (3) Nested Trap Levels 183
UltraSPARC-IIi’s 44-bit Virtual Address Space, with Hole (Same as FIGURE 4-2 on page 25) 184 Translation Table Entry (TTE) (from TSB) 205 TSB Organization 209 222
FIGURE 15-1 FIGURE 15-2 FIGURE 15-3 FIGURE 15-4 FIGURE 15-5 FIGURE 15-6 FIGURE 15-7 FIGURE 15-8 FIGURE 15-9 FIGURE 15-10
MMU Tag Target Registers (Two Registers) D-MMU Primary Context Register 222 D-MMU Secondary Context Register 222 D-MMU Nucleus Context Register 223
I- and D-MMU Synchronous Fault Status Register Format
223 226
D-MMU Synchronous Fault Address Register (SFAR) Format I-/D-TSB Register Format 226
I/D MMU TLB Tag Access Registers 228
Figures
xxv
FIGURE 15-11 FIGURE 15-12 FIGURE 15-13 FIGURE 15-14 FIGURE 15-15 FIGURE 15-16 FIGURE 17-1 FIGURE 18-1 FIGURE 19-1 FIGURE 19-2 FIGURE 19-3 FIGURE 21-1 FIGURE 21-2 FIGURE 21-3 FIGURE 21-4 FIGURE 21-5 FIGURE 21-6 FIGURE 21-7 FIGURE 21-8 FIGURE 21-9 FIGURE 21-10 FIGURE 21-11 FIGURE A-1 FIGURE A-2 FIGURE A-3 FIGURE A-4 FIGURE A-5 FIGURE A-6
I-/D-MMU TSB 8 kB/64 kB Pointer and D-MMU Direct Pointer Register 229 MMU I-/D-TLB Data In/Access Registers 230 MMU TLB Data Access Address, in Alternate Space I-/D-MMU TLB Tag Read Registers MMU Demap Operation Format 232 236 230 230
Formation of TSB Pointers for 8 kB and 64 kB TTEs Reset Block Diagram 262
UPA_CONFIG Register Format 289 Interrupt Vector Data Registers Contents Type 0 Configuration Address Mapping Type 1 Configuration Address Mapping I-cache Organization 340 Odd Fetch to an I-cache Line 342 Next Field Aliasing Between Two Branches 342 Aliasing of Prediction Bits in a Rare CTI Couple Case Artificial Branch Inserted after a 32-byte Boundary Dynamic Branch Prediction State Diagram 346 Handling of Conditional Branches Handling of MOVCC 347 Cost of a Mispredicted Branch (Shaded Area) 348 349 347 343 314 325 326
343
Branch Transformation to Reduce Mispredicted Branches Logical Organization of D-cache 350
VA Data Watchpoint Register Format (ASI 5816, VA=3816) 384 PA Data Watchpoint Register Format (ASI 5816, VA=4016) 384 LSU_Control_Register Access Data Format (ASI 4516) 385 Simplified I-cache Organization (Only 1 Set Shown) 388 I-cache Instruction Access Address Format (ASI 6616) 388 I-cache Instruction Access Data Format (ASI 6616) 389
xxvi
UltraSPARC-IIi User’s Manual • October 1997
FIGURE A-7 FIGURE A-8 FIGURE A-9 FIGURE A-10 FIGURE A-11 FIGURE A-12 FIGURE A-13 FIGURE A-14 FIGURE A-15 FIGURE A-16 FIGURE A-17 FIGURE A-18 FIGURE A-19 FIGURE A-20 FIGURE A-21 FIGURE A-22 FIGURE B-1 FIGURE B-2 FIGURE B-3 FIGURE C-1 FIGURE E-1 FIGURE E-2 FIGURE E-3 FIGURE E-4 FIGURE E-5 FIGURE E-6 FIGURE E-7 FIGURE E-8
I-cache Tag/Valid Access Address Format (ASI 6716) I-cache Tag/Valid Field Data Format (ASI 6716) 389
389
I-cache Predecode Field Access Address Format (ASI 6E16) 390 I-cache Predecode Field LDDA Access Data Format (ASI 6E16) 390 I-cache Predecode Field STXA Access Data Format (ASI 6E16) 390 391 391
I-cache LRU/BRPD/SP/NFA Field Access Address Format (ASI 6F16)
I-cache LRU/BRPD/SP/NFA Field LDDA Access Data Format (ASI 6F16) Dynamic Branch Prediction State Diagram 392 D-cache Data Access Address Format (ASI 4616) D-cache Data Access Data Format (ASI 4616) 393 D-cache Tag/Valid Access Address Format (ASI 4716) D-cache Tag/Valid Access Data Format (ASI 4716) 393 E-cache Data Access Address Format 394 E-cache Data Access Data Format 395 395 393 393
E-cache Tag Access Address Format E-cache Tag Access Data Format 396
Performance Control Register (PCR)
402
Performance Instrumentation Counters (PIC) 402 PCR/PIC Operational Flow Device ID Register 416 403
Data Byte Addresses Within a Dword 421 S_REPLY Timing: UPA64S device Sourcing Block 426 S_REPLY Timing: UPA64S device Sinking Block 427 P_REPLY to S_REPLY Timing S_REPLY Pipelining 427 427
Packet Format: Noncached P_REQ Transactions 429 UPA64s Transactions Flowchart—Address Bus UPA64s Transactions Flowchart—Data Bus 431 430
Figures
xxvii
FIGURE I-1 FIGURE I-2
Dispatch Control Register (ASR 0x18 Diagram of Observability Bus Logic.
458 459
xxviii
UltraSPARC-IIi User’s Manual • October 1997
Tables
TABLE 1-1 TABLE 6-1 TABLE 6-2 TABLE 6-3 TABLE 6-4 TABLE 6-5 TABLE 6-6 TABLE 6-7 TABLE 6-8 TABLE 6-9 TABLE 6-10 TABLE 6-11 TABLE 6-12 TABLE 7-1 TABLE 7-2 TABLE 7-3 TABLE 7-4 TABLE 8-1 TABLE 8-2
Supported Trap Levels 10 UltraSPARC-IIi Address Map 36 Physical address space to PCI space 38 Additional Internal UltraSPARC-IIi CSR space (noncacheable) Mandatory SPARC-V9 ASIs 40 38
UltraSPARC-IIi Extended (non-SPARC-V9) ASIs 42 CSRs Mapped to Non-cacheable Address Space 48 Mandatory SPARC-V9 ASRs 53 Suggested Assembler Syntax for Mandatory ASRs 53 Non-SPARC-V9 ASRs 54
Suggested Assembler Syntax for Non-SPARC V9 ASRs 55 Other UltraSPARC-IIi Registers 55 56 63
Traps Supported in UltraSPARC-IIi
PA[29:27] to RASX_L Mapping for 10-bit Column Address Mode Memory Address Map for 10-bit Column Address Mode 63
PA[29:28] to RASX_L Mapping for 11-bit Column Address Mode Memory Address Map for 11-bit Column Address Mode ASIs that Support SWAP, LDSTUB, and CAS PREFETCH{A} Variants 78 75 66
66
xxix
TABLE 9-1 TABLE 9-2 TABLE 10-1 TABLE 10-2 TABLE 10-3 TABLE 10-4 TABLE 10-5 TABLE 11-1 TABLE 11-2 TABLE 11-3 TABLE 11-4 TABLE 11-5 TABLE 11-6 TABLE 11-7 TABLE 11-8 TABLE 11-9 TABLE 11-10 TABLE 12-1 TABLE 13-1 TABLE 13-2 TABLE 13-3 TABLE 13-4 TABLE 13-5 TABLE 13-6 TABLE 13-7 TABLE 13-8 TABLE 13-9 TABLE 13-10
PCI Command Generation
87
PCI Command Response 88 Description of TLB Tag Fields TLB Data Format 98 PCI DMA Modes of Operation 98 TTE Data Format 102 103 112 116 97
Offset to TSB Table
Interrupt Receiver State Register
INT Code Assignments for Edge-sensitive Interrupts Interrupt State Transition Table Summary of Interrupts 119 121 117
Outgoing Interrupt Vector Data Register Format Interrupt Dispatch Status Register Format 122
Incoming Interrupt Vector Data Register Format Interrupt Vector Receive Register Format 123 SOFTINT Register Format SOFTINT ASRs 125 124
123
Complete UltraSPARC-IIi Instruction Set 127 Graphics Status Register Opcodes 137 GSR Instruction Syntax 137
Partitioned Add/Subtract Instruction Opcodes 139 Partitioned Add/Subtract Instruction Syntax 139 Pixel Formatting Instruction Opcode Format 140 Pixel Formatting Instruction Syntax 141 Partitioned Multiply Instruction Opcodes 147 Partitioned Multiply Instruction Syntax Alignment Instruction Opcodes 154 Alignment Instruction Syntax 154 148
xxx
UltraSPARC-IIi User’s Manual • Draft B
TABLE 13-11 TABLE 13-12 TABLE 13-13 TABLE 13-14 TABLE 13-15 TABLE 13-16 TABLE 13-17 TABLE 13-18 TABLE 13-19 TABLE 13-20 TABLE 13-21 TABLE 13-22 TABLE 13-23 TABLE 13-24 TABLE 13-25 TABLE 13-27 TABLE 13-28 TABLE 13-26 TABLE 13-29 TABLE 13-30 TABLE 13-31 TABLE 13-32 TABLE 13-33 TABLE 13-34 TABLE 13-35 TABLE 14-1 TABLE 14-2 TABLE 14-3
Logical Operate Instructions
156 157
Logical Operate Instruction Syntax
Pixel Compare Instruction Opcodes 159 Pixel Compare Instruction Syntax 159
Edge Handling Instruction Opcodes 161 Edge Handling Instruction Syntax Edge Mask Specification 162 Edge Mask Specification (Little-Endian) Pixel Component Distance Opcode 164 163 161
Pixel Component Distance Syntax 164 Three-Dimensional Array Addressing Instruction Opcodes 165
Three-Dimensional Array Addressing Instruction Syntax 165 Allowable values for rs2 166 Partial Store Opcodes 168
Partial Store Syntax 169 Format (3) LDDFA Format (3) STDFA 170 170 170
Short Floating-Point Load and Store Instruction
Short Floating-Point Load and Store Instruction Syntax 171 Block Load and Store Instruction Opcodes 172
Block Load and Store Instruction Syntax 173 Atomic Quad Load Opcodes 178
Atomic Quad Load Syntax 178 SHUTDOWN Opcode 179 SHUTDOWN Syntax TICK Register Format 179 186
Version Register Format 188 VER.impl Values by UltraSPARC-IIi Model 188
Tables
xxxi
TABLE 14-4 TABLE 14-5 TABLE 14-6 TABLE 14-7 TABLE 14-8 TABLE 14-9 TABLE 14-10 TABLE 14-11 TABLE 14-12 TABLE 14-13 TABLE 15-1 TABLE 15-2 TABLE 15-3 TABLE 15-4 TABLE 15-5 TABLE 15-6 TABLE 15-7 TABLE 15-8 TABLE 15-9 TABLE 15-10 TABLE 15-11 TABLE 15-12 TABLE 15-13 TABLE 15-14 TABLE 15-15 TABLE 15-16 TABLE 15-17 TABLE 15-18
Subnormal Operand Trapping Cases (NS=0)
189
Subnormal Result Trapping Cases (NS=0) 190 Unimplemented Quad-Precision Floating-Point Instructions 192 Floating-Point Status Register Format 193 Floating-Point Rounding Modes 194 Floating-Point Trap Type Values 195 PREFETCH{A} Variants (UltraSPARC-II) 197 TICK_compare Register Format 199 Extended PSTATE Register 201
PSTATE Global Register Selection Encoding 202 Size Field Encoding (from TTE) 206 Cacheable Field Encoding (from TSB) 207 MMU Traps 211 214
Abbreviations for MMU Behavior Abbreviations for ASI Types 214
D-MMU Operations for Normal ASIs 215 I-MMU Operations for Normal ASIs 216
ASI Mapping for Instruction Accesses 217 ASI Mapping for Data Accesses 217 I-MMU and D-MMU Context Register Usage 218 MMU Compliance w/SPARC-V9 Annex F Protection Mode UltraSPARC-IIi MMU Internal Registers and ASI Operations 220 221
MMU Synchronous Fault Status Register FT (Fault Type) Field 224 MMU SFSR Context ID Field Description 224 Effect of Loads and Stores on MMU Registers MMU Demap operation Type Field Description 229 232
MMU Demap Operation Context Field Description 232 Physical Page Attribute Bits for MMU Bypass Mode 234
xxxii
UltraSPARC-IIi User’s Manual • Draft B
TABLE 16-1 TABLE 16-2 TABLE 16-3 TABLE 16-4 TABLE 16-5 TABLE 16-6 TABLE 16-7 TABLE 16-8 TABLE 16-9 TABLE 17-1 TABLE 17-2 TABLE 17-3 TABLE 18-1 TABLE 18-2 TABLE 18-3 TABLE 18-4 TABLE 18-5 TABLE 18-6 TABLE 18-7 TABLE 18-8 TABLE 18-9 TABLE 18-10 TABLE 18-11 TABLE 18-12 TABLE 18-13 TABLE 18-14 TABLE 18-15 TABLE 18-16
Summary of Error Reporting
249 251
E-cache Error Enable Register Format Asynchronous Fault Status Register
252
E-cache Data Parity Syndrome Bit Orderings 253 E-cache Tag Parity Syndrome Bit Orderings Asynchronous Fault Address Register 254 254 253
Error Detection and Reporting in AFAR and AFSR SDBH Error Register Format 256
SDBH Control Register Format 257 Effects of Resets 266 Reset_Control Register 267
Machine State After Reset and in RED_state 272 MCU CSRs 277 FFB_Config Register 278 Mem_Control0 Register 279 DIMMPairPresent Encoding 280 Various Memory Configurations 281
Refresh Period (in 32XCPU clock periods) as a Function of Frequency 281 Mem_Control1 Register 282 AMDC Arguments and Timing ARDC Timing Arguments CSR Delay Timing 284 284 283
CASRW Assertion Time 285 RCD Delay 285
CP – CAS Precharge Time 286 RP Timing 286 RAS Duration Time 287
RSC – RAS Deassert Time 287
Tables
xxxiii
TABLE 18-17 TABLE 19-1 TABLE 19-2 TABLE 19-3 TABLE 19-4 TABLE 19-5 TABLE 19-6 TABLE 19-7 TABLE 19-8 TABLE 19-9 TABLE 19-10 TABLE 19-11 TABLE 19-12 TABLE 19-13 TABLE 19-14 TABLE 19-15 TABLE 19-16 TABLE 19-17 TABLE 19-18 TABLE 19-19 TABLE 19-20 TABLE 19-21 TABLE 19-22 TABLE 19-23 TABLE 19-24 TABLE 19-25 TABLE 19-26 TABLE 19-27
Mem_Control1 values as a function of CPU frequency PBM Registers 292 PCI Control and Status Register 294 PCI PIO Write AFSR PCI PIO Write AFAR 296 297 297
288
PCI Diagnostic Register
PCI Target Address Space Register 298 PCI DMA Write Synchronization Register PIO Data Buffer Diagnostics Access 299 DMA Data Buffer Diagnostics Access 299 300 298
DMA Data Buffer Diagnostics Access (72:64) PBM PCI Configuration Space 300 301
Configuration Space Header Summary Command Register Status Register 303 Latency Timer Register 305 Header Type Register 306 303
Bus Number Register 306 Subordinate Bus Number Register 306 IOMMU Registers 308
IOMMU Control Register 308 Address Space Size And Base Address Determination. IOMMU TSB Base Address Register Flush Address Register 311 311 312 310 309
IOMMU Tag Diagnostics Access
IOMMU Data RAM Diagnostics Access Virtual Address Diagnostic Register 313
IOMMU Tag Comparator Diagnostics Access 313
xxxiv
UltraSPARC-IIi User’s Manual • Draft B
TABLE 19-28 TABLE 19-29 TABLE 19-30 TABLE 19-31 TABLE 19-32 TABLE 19-33 TABLE 19-34 TABLE 19-35 TABLE 19-36 TABLE 19-37 TABLE 19-38 TABLE 19-39 TABLE 19-40 TABLE 19-41 TABLE 19-42 TABLE 19-43 TABLE 19-44 TABLE 19-45 TABLE 19-46 TABLE 21-1 TABLE 22-1 TABLE 22-2 TABLE A-1 TABLE A-2 TABLE A-3 TABLE B-1 TABLE B-2 TABLE C-1 TABLE C-2
Interrupt Number Offset Assignments 314 Partial Interrupt Mapping Registers 316 Format of Partial Interrupt Mapping Registers 317 Full Interrupt Mapping Registers 318 Format of Full Interrupt Mapping Registers Clear Interrupt Pseudo Registers Clear Interrupt Register 320 320 319 318
Interrupt State Diagnostic Registers
Level Interrupt State Assignment 321 Pulse Interrupt State Assignment 321 PCI Interrupt State Diagnostic Register Definition OBIO and Misc Int Diag Reg Definition PCI INT_ACK Register Format 323 322 321
Physical Address Space to PCI Space Mappings 324 PCI DMA Modes of Operation 328 DMA Error Registers DMA UE AFSR 331 333 330
DMA UE/CE AFAR DMA CE AFSR 334
D-cache Miss, E-cache Hit Latency Depends on SRAM Mode 352 Abbreviations Used in TABLE 22-2 379 Latencies for Floating-Point and Graphics Instructions 380 ASIs Affected by Watchpoint Traps 383 LSU Control Register: Parity Mask Examples 386 387
LSU Control Register: VA/PA Data Watchpoint Byte Mask Examples PiC.S0 Selection Bit Field Encoding 407 PIC.S1 Selection Bit Field Encoding 407 IEEE 1149.1 Signals 410
TAP Controller State Diagram 411
Tables
xxxv
TABLE C-3 TABLE C-4 TABLE D-1 TABLE E-1 TABLE E-2 TABLE E-3 TABLE E-4 TABLE E-5 TABLE F-1 TABLE F-2 TABLE F-3 TABLE F-4 TABLE F-5 TABLE F-6 TABLE F-7 TABLE F-8 TABLE F-9 TABLE G-1 TABLE I-1
Instruction Register Behavior 414 IEEE 1149.1 Instruction Encodings 415
Syndrome table for ECC SEC/S4ED code . 419 P_REPLY Type Definitions 424 P_REPLY Encoding 424 S_REPLY Type Definitions 425 S_REPLY Encoding 426 Transaction Type Encoding 429
Pin Reference - External Cache (E-cache) Interface 434 Pin Reference - Internal, SRAM, and UPA Clock Interface 436 Pin Reference - PCI Clock Interface 437 Pin Reference - JTAG/Debug Interface 438
Pin Reference - Initialization Interface 439 Pin Reference - PCI interface 440
Pin Reference - Interrupt Interface 441 Pin Reference - Memory and Transceiver Interface Pin Reference - UPA64S Interface 443 ASI Names—listed alphabetically 445 Group Select Bits 458 442
xxxvi
UltraSPARC-IIi User’s Manual • Draft B
Preface
Overview
Welcome to the UltraSPARC-IIi User’s Manual. This book contains information about the architecture and programming of UltraSPARC-IIi, one of Sun Microsystems’ family of processors that are SPARC-V9-compliant as well as meeting the requirements of the PCI specification, version 2.1. This manual describes the UltraSPARC-IIi processor implementation. This book contains information on:
s s s
s s s
s s s s s s s s
The UltraSPARC-IIi system architecture The components that make up an UltraSPARC-IIi processor Memory and low-level system management, including detailed information needed by operating system programmers Extensions to and implementation-dependencies of the SPARC-V9 architecture Techniques for managing the pipeline and for producing optimized code Instruction set, instruction grouping rules for efficient execution, address space identifiers, and event ordering Data and address formats External interfaces and their support, including PCI, memory, and UPA64S Interrupts and traps Memory models Debug and diagnostic provisions, including performance instrumentation Power management Performance instrumentation and Boundary Scan (IEEE 1149) support Compatibility considerations with regard to prior processors
xxxvii
A Brief History of SPARC and PCI
SPARC stands for Scalable Processor ARChitecture, which was first announced in 1987. Unlike more traditional processor architectures, SPARC is an open standard, freely available through license from SPARC International, Inc. Any company that obtains a license can manufacture and sell a SPARC-compliant processor. By the early 1990s SPARC processors were available from over a dozen different vendors, and over 8,000 SPARC-compliant applications had been certified. In 1994, SPARC International, Inc. published The SPARC Architecture Manual, Version 9, which defined a powerful 64-bit enhancement to the SPARC architecture. SPARC-V9 provided support for:
s s s s
64-bit virtual addresses and 64-bit integer data Fault tolerance Fast trap handling and context switching Big- and little-endian byte orders
UltraSPARC is the first family of SPARC-V9-compliant processors available from Sun Microsystems, Inc. The Peripheral Component Interconnect (PCI) bus specification was first issued in June 1992 (at version 1.0) by the PCI Special Interest Group to define a highperformance bus for peripheral components. In 1993 they added a connector specification. The current version 2.1 document added a 66 MHz bus specification and was released in June, 1995. The PCI Local Bus uses multiplexed address and data lines and is well suited for connecting large bandwidth peripheral components. It is used to interconnect highly-integrated peripheral-controller components, peripheral add-in boards, and processor and memory systems and offers the following advantages:
s s s s s s s
s s
Peripheral compatibility with existing drivers and application software 32-bit or 64-bit data bus width and 64-bit addressing are supported Synchronous Peripheral bus Processor-independent bus optimized for I/O functions Bus operation concurrent with processor subsystem Peripheral access from anywhere in memory or I/O space Peripheral latency minimized by efficient coupling with processor bus, cache, and memory 33 and 66 MHz bus clock specification PCI peripherals contain registers with information for their configuration
xxxviii
UltraSPARC-IIi User’s Manual • October 1997
Sun provides the optional Advanced PCI Bridge (APB TM) ASIC for an optimized PCI interface with the UltraSPARC-IIi processor.
How to Use This Book
This book is a companion to The SPARC Architecture Manual, Version 9, which is available from many technical bookstores or directly from its copyright holder: SPARC International, Inc. 535 Middlefield Road, Suite 210 Menlo Park, CA 94025 (415) 321-8692 The SPARC Architecture Manual, Version 9 provides a complete description of the SPARC-V9 architecture. Since SPARC-V9 is an open architecture, many of the implementation decisions have been left to the manufacturers of SPARC-compliant processors. These “implementation dependencies” are introduced in The SPARC Architecture Manual, Version 9. This book, the UltraSPARC User’s Manual, describes the UltraSPARC-IIi implementation of the SPARC-V9 architecture. It provides specific information about UltraSPARC-IIi processors, including how each SPARC-V9 implementation dependency was resolved. (See Chapter 14, “Implementation Dependencies” for specific information.) This manual also describes extensions to SPARC-V9 that are available (currently) only on UltraSPARC-IIi processors. A great deal of background information and a number of architectural concepts are not contained in this book. You will find cross references to The SPARC Architecture Manual, Version 9 located throughout this book. You should have a copy of that book at hand whenever you are working with the UltraSPARC-IIi User’s Manual. For detailed information about the electrical and mechanical characteristics of the processor, including pin and pad assignments, consult the UltraSPARC-IIi Data Sheet. The section: “Bibliography” on page 485 describes how to obtain the data sheet.
Textual Conventions
This book uses the same textual conventions as The SPARC Architecture Manual, Version 9. They are summarized here for convenience. Fonts are used as follows:
s
Italic font is used for register names, instruction fields, and read-only register fields.
Preface xxxix
s s s s s
s s s
s s
courier font is used for literals and software examples. Bold font is used for emphasis. UPPER CASE items are acronyms, instruction names, or writable register fields. Italic sans serif font is used for exception and trap names. Underbar characters (_) join words in register, register field, exception, and trap names. Such words can be split across lines at the underbar without an intervening hyphen. The following notational conventions are used: Square brackets ‘[ ]’ indicate a numbered register in a register file. Angle brackets ‘’ indicate a bit number or colon-separated range of bit numbers within a field. Curly braces ‘{ }’ are used to indicate textual substitution. The symbol designates concatenation of bit vectors. A comma ‘,’ on the left side of an assignment separates quantities that are concatenated for the purpose of assignment.
Contents
This manual has the following organization: The initial part of this book gives an overview of the UltraSPARC-IIi and contains the following chapters:
s
s s s
s
s
s
s
s
Chapter 1, “UltraSPARC-IIi Basics,” describes the architecture in general terms and introduces its components. Chapter 2, “Processor Pipeline,” describes UltraSPARC-IIi’s 9-stage pipeline. Chapter 3, “Cache Organization,” describes the UltraSPARC-IIi caches. Chapter 4, “Overview of I and D-MMUs, “ describes the UltraSPARC-IIi MMU, its architecture, how it performs virtual address translation, and how it is programmed. Chapter 5, “UltraSPARC-IIi in a System,” briefly describes the UltraSPARC-IIi configuration. Chapter 6, “Address Spaces, ASIs, ASRs, and Traps discusses physical and virtual address space mapping and identifiers. It lists address and port assignments, including those for PCI, and also gives memory DIMM requirements. Chapter 7, “UltraSPARC-IIi Memory System,” discusses DRAM memory hardware structure, selection, and addressing. Chapter 8, “Cache and Memory Interactions,” deals with the requirements to preserve data integrity during cache and memory operations and describes instructions used in these cases. Chapter 9, “PCI Bus Interface,” describes the PCI Bus Interface Module of UltraSPARC-IIi which is a host PCI bridge.
xl
UltraSPARC-IIi User’s Manual • October 1997
s
s
s
s
s
Chapter 10, “UltraSPARC-IIi IOM,” details the IO Memory Management Unit (IOM), which performs virtual to physical address translation. Chapter 11, “Interrupt Handling,” describes how UltraSPARC-IIi processes interrupts. Chapter 12, “Instruction Set Summary,” provides a list of all supported instructions, including SPARC-V9 core instructions and UltraSPARC-IIi extensions. Chapter 13, “VIS™ and Additional Instructions,” contains detailed documentation of the extended instructions that UltraSPARC-IIi adds to the SPARC-V9 instruction set, including those relating to power management, graphics, and memory-access and control. Chapter 14, “Implementation Dependencies,” discusses how UltraSPARC-IIi resolves each of the implementation-dependencies defined by the SPARC-V9 architecture.
The latter part of the book presents detailed information about UltraSPARC-IIi architecture and programming. This section contains the following chapters:
s s
s
s s s
s
s
s
s
s
s
s
Chapter 15, “MMU Internal Architecture Chapter 16, “Error Handling,” discusses how UltraSPARC-IIi handles system errors and describes the available error status registers. Chapter 17, “Reset and RED_state,” describes how UltraSPARC-IIi handles the various SPARC-V9 reset conditions, and how it implements RED_state. Chapter 18, “MCU Control and Status Registers,” Chapter 19, “UltraSPARC-IIi PCI Control and Status,” Chapter 20, “SPARC-V9 Memory Models,” describes the supported memory models (which are documented fully in The SPARC Architecture Manual, Version 9). Low-level programmers and operating system implementors should study this chapter to understand how their code will interact with the UltraSPARC-IIi cache and memory systems. Chapter 21, “Code Generation Guidelines,” contains detailed information about generating optimum UltraSPARC-IIi code. Chapter 22, “Grouping Rules and Stalls,” describes instruction interdependencies and optimal instruction ordering. Appendices contain low-level technical material or information not needed for a general understanding of the architecture. The manual contains the following appendices: Appendix A, “Debug and Diagnostics Support,” describes diagnostics registers and capabilities. Appendix B, “Performance Instrumentation,” describes built-in capabilities to measure UltraSPARC-IIi performance. Appendix C, “IEEE 1149.1 Scan Interface,” contains information about the diagnostic boundary-scan interface for UltraSPARC-IIi. Appendix D, “ECC Specification,” details the specification for the error correcting code (ECC) used in transactions between processor and DRAMs
Preface xli
s
s
s
s
s
s
s
Appendix E, “UPA64S interface,” describes transactions and data format on the MEMDATA bus. Appendix F, “Pin and Signal Descriptions, ” contains general information about the pins and signals of the UltraSPARC-IIi and its components. Appendix G, “ASI Names,” contains an alphabetical listing of the names and suggested macro syntax for all supported ASIs.,” Appendix H, “Event Ordering on UltraSPARC-IIi” discusses ordering of load and store operations. Appendix I, “Observability Bus” describes this bus that can help bring up the processor and provide performance monitoring. Appendix J, “List of Compatibility Notes,” provides a reference list of the compatibility notes from the various chapters of the text. Appendix K, “Errata,” lists errata for the UltraSPARC-IIi.
A Glossary, Bibliography, and Index complete the book.
xlii
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
1
UltraSPARC-IIi Basics
1.1
Overview
UltraSPARC-IIi is a high-performance, highly integrated superscalar processor implementing the 64-bit SPARC-V9 RISC architecture that also includes on-chip memory and I/O control. It supports Sun's popular Solaris operating system and is binary-compatible with all ultraSPARC software. Each functional area on the UltraSPARC-IIi maintains decentralized control, allowing many activities to overlap. The design supports the following features:
s
s
s s s
s
Sustained issue of up to 4 instructions per cycle (even in the presence of conditional branches and cache misses) with a decoupled Prefetch and Dispatch Unit. Load buffers on the input side of the Execution Unit, together with store buffers on the output side, decouple pipeline execution from data cache misses. Instructions are issued in program order to multiple functional units. Instructions execute in parallel and may complete out of order. Instructions from two basic blocks (that is, instructions before and after a conditional branch) can be issued in the same group. Separate Memory Control and PCI I/O interface units also decouple their related key activities from the instruction pipeline.
UltraSPARC-IIi includes a full implementation of the 64-bit SPARC-V9 architecture. It supports a 44-bit virtual address space and a 41-bit physical address space with 64-bit address pointers. The core instruction set is extended to include the VIS instruction set − graphics instructions that provide the most common operations related to two-dimensional image processing, two- and three-dimensional graphics and image compression algorithms, and parallel operations on pixel data with 8and 16-bit components. Support for high bandwidth memory to memory transfers also provided through 64-byte block load and block store instructions.
1
1.2
Design Philosophy
The execution time of an application is the product of three factors: the number of instructions generated by the compiler, the average number of cycles required per instruction, and the cycle time of the processor. The architecture and implementation of UltraSPARC-IIi, coupled with new compiler techniques, makes it possible to reduce each component while not deteriorating the other two. The number of instructions for a given task depends on the instruction set and on compiler optimizations (dead code elimination, constant propagation, profiling for code motion, and so on). Since it is based on the SPARC-V9 architecture, UltraSPARC-IIi offers features that can help reduce the total instruction count: s 64-bit integer processing s Additional floating-point registers (beyond the number offered in SPARC-V8) that can be used to eliminate floating-point loads and stores s Enhanced trap model with alternate global registers The average number of cycles per instruction (CPI) depends on the architecture of the processor and on the ability of the compiler to take advantage of the hardware features offered. The UltraSPARC-IIi execution units (ALUs, LD/ST, branch, two floating-point, and two graphics) allow the CPI to be as low as 0.25 (four instructions per cycle). To support this high execution bandwidth, sophisticated hardware is provided to supply: 1. Up to four instructions per cycle, even in the presence of conditional branches 2. Data at a rate of eight bytes per two cycles from the external cache to the data cache, and eight bytes per cycle into the register files. To reduce instruction dependency stalls, UltraSPARC-IIi has short latency operations and provides direct bypassing between units or within the same unit. The impact of cache misses, usually a large contributor to the CPI, is reduced significantly through the use of decoupled units: (prefetch unit, load buffer, store buffer, and memory control) that operate asynchronously with the rest of the pipeline. The Memory Control Unit (MCU) is responsible for DRAM and UPA64S control which is accomplished in synchronism with the processor clock. The DRAM interface is expanded from 64 + 8 ECC bits to 128 + 16 ECC bits by means of external data transceivers. This configuration maximizes the EDO CAS cycle rate. The MCU specification is wide enough to embrace all major vendors’ DRAM specifications. Other features such as a fully pipelined interface to the external cache (E-Cache) and support for speculative loads, coupled with sophisticated compiler techniques such as software pipelining and cross-block scheduling also reduce the CPI significantly.
2
UltraSPARC-IIi User’s Manual • October 1997
The PCI Bus Module (PBM) provides a direct interface with a 32-bit PCI bus that meets PCI specification version 2.1. This module is internally linked with the External Cache Unit (ECU) and the IOM. The IO Memory Management Unit (IOM) manages virtual to physical memory address mapping using a 16-entry Translation Lookaside Buffer (TLB) in conjunction with a large Translation Storage Buffer (TSB) in memory. The PCI bus can run at 66 MHz or at 33 MHz. Up to four Advanced PCI Bridge ASICs (APB)s may be used with the UltraSPARC-IIi, each of which can support up to two 33 MHz secondary PCI busses. PCI DMA transfers are cache-coherent. A balanced architecture must be able to provide a low CPI without affecting the cycle time. Several of UltraSPARC-IIi’s architectural features, coupled with an aggressive implementation and state-of-the-art technology, make it possible to achieve a short cycle time (see TABLE 1-1). The pipeline is organized so that large scalarity (four), short latencies, and multiple bypasses do not affect the cycle time significantly.
1.3
Component Description
FIGURE 1-1 shows a block diagram that illustrates the components of the
UltraSPARC-IIi processor. In a single-chip implementation, UltraSPARC-IIi integrates these components:
s
s s
s s
s s s
s
s s
s
Independently clocked (132 MHz internal, 66 or 33 MHz external) PCI interfaces, fully decoupled from the main CPU PCI bus module (PBM) PCI I/O memory management unit (IOM) with 16 entries for incoming I/O to physical mapping/protection External (E-cache) cache control unit (ECU) Memory controller unit (MCU), operates both the 144-bit-wide DRAM subsystem and the UPA64S interface 16-Kilobyte instruction cache (I-Cache) 16-Kilobyte data cache (D-cache) Prefetch, branch prediction and dispatch unit (PDU) containing grouping logic and an instruction buffer A 64-entry instruction translation lookaside buffer (iTLB) and a 64-entry data translation lookaside buffer (dTLB) Integer execution unit (IEU) with two arithmetic logic units (ALUs) Floating-point unit (FPU) with independent add, multiply and divide/square root sub-units Graphics unit (GRU) composed of two independent execution pipelines
Chapter 1 UltraSPARC-IIi Basics 3
s
Load buffer and store buffer unit (LSU), decoupling data accesses from the pipeline
PCI External Cache RAM Main Memory & UPA64S Bus
PCI BUS MODULE (PBM)
I/O MEMORY MANAGEMENT UNIT (IOM)
EXTERNAL CACHE UNIT (ECU)
MEMORY AND UPA64S CONTROL UNIT (MCU)
INSTRUCTION CACHE (I CACHE)
DATA CACHE (D CACHE)
PREFETCH AND DISPATCH UNIT GROUPING LOGIC (PDU)
INSTRUCTION BUFFER
INSTRUCTION TRANSLATION LOOKASIDE BUFFER (iTLB)
DATA TRANSLATION BUFFER (dTLB)
INTEGER REGISTER FILE
FLOATING POINT REGISTER FILE (FPU) FP MULTIPLY
LOAD STORE UNIT (LSU)
INTEGER EXECUTION UNIT (IEU)
FP ADD FP DIVIDE GRAPHICS UNIT(GRU)
LOAD QUEUE
STORE QUEUE
FIGURE 1-1
UltraSPARC-IIi Block Diagram
4
UltraSPARC-IIi User’s Manual • October 1997
1.3.1
PCI Bus Module (PBM)
The PBM interfaces UltraSPARC-IIi directly with a 32-bit PCI bus, compliant to the PCI specification, revision 2.1. The PCI bus runs at speeds up to 66 MHz, typically 33 and 66 MHz. The PBM is optimized for 16-, 32- and 64-byte transfers, and can support up to four PCI bus masters. The module also queues pending interrupts received from the interrupt concentrator (or RIC--SME2210) chip or programmable logic device (PLD). The entire PCI address space is noncacheable for CPU references, but coherent DMA is supported. (This means that all writes to memory from PCI, and reads from memory, are cache coherent.) Interrupt handling is synchronized to the completion of all prior DMA writes. The PCI data path is illustrated in FIGURE 1-2.
External Cache Unit (ECU) TTE Address 32
Memory Control Unit (MCU)
41 Phys Addr PIO Data DMA Data
I/O Memory Management Unit & CSR (IOM)
handshaking
PCI Data Path (PDP)
I/O Space Address 32
Phys Addr 41 PCI Bus Module Control and Status Registers (CSR) and Arbiter 32 19 Interrupt/Error and Reset
PCI Sub-System Boundary
PIO Data
FIGURE 1-2
PCI
DMA Data
UltraSPARC-IIi PCI and MCU Subsystems
Chapter 1
UltraSPARC-IIi Basics
5
1.3.2
IO Memory Management Unit (IOM)
The IOM performs address translations from 32-bit DVMA to 34-bit physical addresses when UltraSPARC-IIi is a PCI target (when DVMA read/write access is required). The IOM uses a fully associative 16-entry TLB (translation lookaside buffer). In the case of a TLB miss, the IOM performs a single-level hardware tablewalk into the large TSB (translation storage buffer) in memory.
1.3.3
External Cache Control Unit (ECU)
The main role of the ECU is to handle I-cache and D-Cache misses efficiently. The ECU can handle one access every other cycle to the external cache. Loads that miss in the D-cache cause 16-byte D-cache fills using two consecutive 8-byte accesses to the E-cache. Stores are writethrough to the E-cache and are fully pipelined. Instruction prefetches that miss the I-cache cause 32-byte I-cache fills using four consecutive 8-byte accesses to the E-cache. The E-cache is parity-protected. In addition, the ECU supports DMA accesses which hit in the external cache and maintains data coherency between the external cache and the main memory. The size of the external cache can be 256 kB, 512 kB, 1 MB, or 2 MB (where the line size is always 64 bytes). Cache lines have only 3 states: modified, exclusive and invalid. The combination of the load buffer and the ECU is fully pipelined. For programs with large data sets, instructions are scheduled with load latencies based on the E-Cache latency, so the E-cache acts like a large primary cache. Floating-point applications use this feature to effectively “hide” D-Cache misses. Coherency is maintained between all caches and external PCI DMA references. The ECU overlaps processing during load and store misses. Stores that hit the E-Cache can proceed while a load miss is being processed. The ECU is also capable of processing reads and writes without a costly turnaround penalty on the bidirectional E-cache data bus. Block loads and block stores (these load or store a 64-byte line of data from memory or E-cache to the floating-point register file) provide high transfer bandwidth. By not installing into the E-cache on miss, they avoid polluting the cache with data that is only touched once. The ECU also provides support for multiple outstanding data transfer requests to the MCU and PBM.
1.3.3.1
E-Cache SRAM Modes
The UltraSPARC-IIi supports two alternative E-cache SRAM configurations that have particular operational modes:
6
UltraSPARC-IIi User’s Manual • October 1997
s s
2 – 2 – 2 (Pipelined) mode and 2 – 2 (Register-Latched) mode
In 2 – 2 – 2 (Pipelined) mode the E-cache SRAMs have a cycle time equal to half the processor cycle time. The name “2–2–2” indicates that it takes two processor clocks to send the address, two to access the SRAM array, and two to return the E-Cache data. 2–2–2 mode has a 6 cycle pin-to-pin latency and provides the least expensive SRAM solution at a given frequency. In 2 – 2 (Register-Latched) mode the E-cache SRAMs also have a cycle time equal to half of the processor cycle time. The name “2–2” indicates that it takes two processor clocks to send the address and two clocks to access and return the E-Cache data. 2–2 mode has a 4 cycle pin-to-pin latency, which provides lower E-Cache latency. In addition, no dead cycles are necessary when alternating between reads and writes because of tighter control over turn on and turn off times in these SRAMs.
1.3.4
Memory Controller Unit (MCU)
All transactions to the DRAM and UPA64S subsystems are handled by the MCU. The external pins controlled by the MCU operate at divisions of the processor clock: The UPA64S bus runs at 1/3 the rate of the processor clock. The data transfer rate through the DRAM transceivers is programmable but typically occurs at 1/4 of the processor clock rate. Other options are 1/3 or 1/5 of the processor clock rate. External data transceivers allow the DRAM data to be twice as wide as the processor’s MEMDATA pins, so the EDO CAS cycle is only 26.5 ns at 300 MHz. The MCU supports a composite DRAM specification which is a superset of 60 ns EDO DRAM specifications from all major vendors. These transceivers are commodity parts available from Texas Instruments. Use of faster DRAMs allow performance higher than quoted, as the various components of memory delay are programmable. A typical memory configuration is shown in FIGURE 1-3
Chapter 1
UltraSPARC-IIi Basics
7
TA(15:0) Tag TD(14+2+P) Clock UltraSPARC-IIi DA(18+8BE) 512 KB L2 Cache D(64+8P) Memory Address and Control
Data (64+8ECC)
Transceivers
Memory Data (128+16ECC)
Memory (8 DIMMs)
FIGURE 1-3
UltraSPARC-IIi Memory—Typical Configuration
1.3.5
Instruction Cache (I-cache)
The I-cache is a 16 Kilobyte two-way set-associative cache with 32-byte blocks. The cache is physically indexed and physically tagged. The set is predicted as part of the “next field” so that only the index bits of an address are necessary to address the cache. (This means only 13 bits, which matches the minimum page size.) The instruction cache returns up to 4 instructions from a line that is 8 instructions wide.
8
UltraSPARC-IIi User’s Manual • October 1997
1.3.6
Data Cache (D-cache)
The data cache is a write-through non-allocating 16 Kilobyte direct-mapped cache with two 16-byte sublocks per line. It is virtually indexed and physically tagged. The tag array is dual-ported so that tag updates due to line fills do not collide with tag reads for incoming loads. Snoops to the D-Cache use the second tag port so that an incoming load can proceed without being held up by a snoop.
1.3.7
Prefetch and Dispatch Unit (PDU)
The PDU fetches instructions before they are needed in the pipeline, so that the execution units do not starve for instructions. Instructions can be prefetched from all levels of the memory hierarchy, including the instruction cache, the external cache and the main memory. To prefetch across conditional branches, a dynamic branch prediction scheme is implemented in hardware, based on a two-bit history of the branch. A “next field” associated with every four instructions in the I-Cache points to the next I-Cache line to be fetched. This makes it possible to follow taken branches and provides the same instruction bandwidth achieved during sequential code. Up to 12 prefetched instructions are stored in the instruction buffer sent to the rest of the pipeline.
1.3.8
Translation Lookaside Buffers (iTLB and dTLB)
The Translation Lookaside Buffers provide mapping between 44-bit virtual addresses and 34-bit physical addresses. A 64-entry iTLB is used for instructions and a 64-entry dTLB for data, and both are fully associative. UltraSPARC-IIi provides hardware support for a software-based TLB miss strategy. A trap to special software handlers installs new entries, typically with a latency of the order of an E-cache miss. A separate set of global registers is available whenever such a trap is encountered, for low latency miss handling. Page sizes of 8 kB, 64 kB, and 512 kB and 4 MB are supported.
1.3.9
Integer Execution Unit (IEU)
The IEU contains the following components:
s s s s
Two ALUs A multi-cycle integer multiplier A multi-cycle integer divider Eight register windows
Chapter 1
UltraSPARC-IIi Basics
9
s s
Four sets of global registers (normal, alternate, MMU, and interrupt globals) The trap registers (See TABLE 1-1 for supported trap levels)
TABLE 1-1 shows that UltraSPARC-IIi supports one more than the four trap levels
mandated by the SPARC Version 9 specification.
Supported Trap Levels
UltraSPARC-IIi MAXTL Trap Levels 4 5
TABLE 1-1
1.3.10
Floating-Point Unit (FPU)
The separation of the execution units in the FPU allows UltraSPARC-IIi to issue and execute two floating-point instructions per cycle. Source data and results data are stored in the 32-entry register file, where each entry can contain a 32- or 64-bit value. Most instructions are fully pipelined (throughput of one per cycle), have a latency of three, and are not affected by the precision of the operands (same latency for single or double precision). The divide and square-root instructions are not pipelined. These take 12 cycles (single precision) and 22 cycles (double precision) to execute, but they do not stall the processor. Other instructions, following the divide/square root can be issued, executed, and retired to the register file before the divide/square root finishes. A precise exception model is maintained by synchronizing the floating-point pipe with the integer pipe and by predicting traps for long-latency operations.
1.3.11
Graphics Unit (GRU)
UltraSPARC-IIi introduces a comprehensive set of graphics instructions (VIS) that provide industry-leading support for two-dimensional and three-dimensional image and video processing, image compression, audio processing, and similar functions. Sixteen-bit and 32-bit partitioned add, boolean and compare are provided. Eight-bit and 16-bit partitioned multiplies are supported. Single cycle pixel distance, data alignment, packing and merge operations are all supported in the GRU.
10
UltraSPARC-IIi User’s Manual • October 1997
1.3.12
Load/Store Unit (LSU)
The LSU is responsible for generating the virtual address of all loads and stores (including atomics and ASI loads), for accessing the data cache, for decoupling load misses from the pipeline through the load buffer, and for decoupling the stores through a store buffer. One load or one store can be issued per cycle. The store buffer can compress (or gather) multiple stores to the same 8 bytes into a single E-cache access. The UPA64S and PCI control units can compress sequential 8-byte stores into burst transactions, to improve noncacheable store bandwidth.
1.3.13
Phase Locked Loops (PLL)
To minimize the clock skew at the system level UltraSPARC-IIi has PLLs for both the processor clock and the PCI clock. The internal PCI clock runs at twice the speed of the PCI interface clock. For details, see Appendix F, “Pin and Signal Descriptions.”
1.3.14
Signals
All external cache signals are 2.6 V and exist only on the processor module. All other signals are 3.3V LVTTL. The highest frequency signal that comes from the module to the motherboard is 75 MHz. (unless the 100 MHz UPA64S interface is used). This allows cost savings in motherboard design.
FIGURE 1-3 on page 8 shows an UltraSPARC-IIi subsystem, which consists of the
UltraSPARC-IIi processor and synchronous SRAM components for the E-cache tags and data.
Chapter 1
UltraSPARC-IIi Basics
11
12
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
2
Processor Pipeline
2.1
Introductions
UltraSPARC-IIi contains a nine-stage pipeline. Most instructions go through the pipeline in exactly 9 stages. The instructions are considered terminated after they go through the last stage (W), after which changes to the processor architectural state are irreversible. FIGURE 2-1 shows a simplified diagram of the integer and floatingpoint pipeline stages.
Integer Pipeline
Fetch Decode Group Execute Cache N1 N2 N3 Write
Floating-Point & Graphics Pipeline
FIGURE 2-1
Register
X1
X2
X3
UltraSPARC-IIi Pipeline Stages (Simplified)
Three additional stages are added to the integer pipeline to make it symmetrical with the floating-point pipeline. This simplifies pipeline synchronization and exception handling. It also eliminates the need to implement a floating-point queue. Floating-point instructions with a latency greater than three (divide, square root, and inverse square root) behave differently than other instructions; the pipe is “extended” when the instruction reaches stage N 1. See Chapter 21, “Code Generation Guidelines” for more information. Memory operations are allowed to proceed asynchronously with the pipeline in order to support latencies longer than the latency of the on-chip D-cache.
13
2.2
Pipeline Stages
This section describes each pipeline stage in detail. FIGURE 2-2 illustrates the pipeline stages.
F/D
G
E
C Icc
N1
N2
N3
W
IEU
IU Register File Annex
(Results in Annex)
PDU
Instruction Buffers
IST_data VA D-Cache Tag TLB D-Cache Data align Tag Check
Hit PA
LSU
LDQ/STQ SB
ECU
FP RF 32 x 64
FPST_data FP add G ALU FP mul G mul X1 X2 X3
address bus data bus instruction bus
FPU GRU
R
FIGURE 2-2
UltraSPARC-IIi Pipeline Stages (Detail)
14
UltraSPARC-IIi User’s Manual • October 1997
2.2.1
Stage 1: Fetch (F) Stage
Prior to their execution, instructions are fetched from the Instruction Cache (I-cache) and placed in the Instruction Buffer, where eventually they will be selected to be executed. Accessing the I-cache is done during the F Stage. Up to four instructions are fetched along with branch prediction information, the predicted target address of a branch, and the predicted set of the target. The high bandwidth provided by the I-cache (4 instructions/cycle) allows UltraSPARC-IIi to prefetch instructions ahead of time based on the current instruction flow and on branch prediction. Providing a fetch bandwidth greater than or equal to the maximum execution bandwidth assures that, for well behaved code, the processor does not starve for instructions. Exceptions to this rule occur when branches are hard to predict, when branches are very close to each other, or when the I-cache miss rate is high.
2.2.2
Stage 2: Decode (D) Stage
After being fetched, instructions are pre-decoded and then sent to the Instruction Buffer. The pre-decoded bits generated during this stage accompany the instructions during their stay in the Instruction Buffer. Upon reaching the next stage (where the grouping logic lives) these bits speed up the parallel decoding of up to 4 instructions. While it is being filled, the Instruction Buffer also presents up to 4 instructions to the next stage. A pair of pointers manage the Instruction Buffer, ensuring that as many instructions as possible are presented in order to the next stage.
2.2.3
Stage 3: Grouping (G) Stage
The G-stage logic’s main task is to group and dispatch a maximum of four valid instructions in one cycle. It receives a maximum of four valid instructions from the Prefetch and Dispatch Unit (PDU), it controls the Integer Core Register File (ICRF), and it routes valid data to each integer functional unit. The G-stage sends up to two floating-point or graphics instructions out of the four candidates to the FloatingPoint and Graphics Unit (FGU). The G-stage logic is responsible for comparing register addresses for integer data bypassing and for handling pipeline stalls due to interlocks.
Chapter 2
Processor Pipeline
15
2.2.4
Stage 4: Execution (E) Stage
Data from the integer register file is processed by the two integer ALUs during this cycle (if the instruction group includes ALU operations). Results are computed and are available for other instructions (through bypasses) in the very next cycle. The virtual address of a memory operation is also calculated during the E Stage, in parallel with ALU computation. FLOATING-POINT AND GRAPHICS UNIT: The Register (R) Stage of the FGU. The floatingpoint register file is accessed during this cycle. The instructions are also further decoded and the FGU control unit selects the proper bypasses for the current instructions.
2.2.5
Stage 5: Cache Access (C) Stage
The virtual address of memory operations calculated in the E-stage is sent to the tag RAM to determine if the access (load or store type) is a hit or a miss in the D-cache. In parallel the virtual address is sent to the data MMU to be translated into a physical address. On a load when there are no other outstanding loads, the data array is accessed so that the data can be forwarded to dependent instructions in the pipeline as soon as possible. ALU operations executed in the E-stage generate condition codes in the C Stage. The condition codes are sent to the PDU, which checks whether a conditional branch in the group was correctly predicted. If the branch was mispredicted, earlier instructions in the pipe are flushed and the correct instructions are fetched. The results of ALU operations are not modified after the E Stage; the data merely propagates down the pipeline (through the annex register file), where it is available for bypassing for subsequent operations. FLOATING-POINT AND GRAPHICS UNIT: The X1 Stage of the FGU. Floating-point and graphics instructions start their execution during this stage. Instructions of latency one also finish their execution phase during the X 1 Stage.
2.2.6
Stage 6: N1 Stage
A data cache (D-cache) miss/hit or a TLB miss/hit is determined during the N 1 Stage. If a load misses the D-cache, it enters the Load Buffer. The access will arbitrate for the E-cache if there are no older unissued loads. If a TLB miss is detected, a trap will be taken and the address translation is obtained through a software routine. The physical address of a store is sent to the Store Buffer during this stage. To avoid pipeline stalls when store data is not immediately available, the store address and data parts are decoupled and sent to the Store Buffer separately.
16
UltraSPARC-IIi User’s Manual • October 1997
FLOATING-POINT AND GRAPHICS UNIT: The X2 stage of the FGU. Execution continues for most operations.
2.2.7
Stage 7: N2 Stage
Most floating-point instructions finish their execution during this stage. After N 2, data can be bypassed to other stages or forwarded to the data portion of the Store Buffer. All loads that have entered the Load Buffer in N 1 continue their progress through the buffer; they will reappear in the pipeline only when the data comes back. Normal dependency checking is performed on all loads, including those in the load buffer. FLOATING-POINT
AND GRAPHICS UNIT:
The X3 stage of the FGU.
2.2.8
Stage 8: N3 Stage
UltraSPARC-IIi resolves traps at this stage.
2.2.9
Stage 9: Write (W) Stage
All results are written to the register files (integer and floating-point) during this stage. All actions performed during this stage are irreversible. After this stage, instructions are considered terminated.
Chapter 2
Processor Pipeline
17
18
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
3
Cache Organization
3.1
3.1.1
Introduction
Level-1 Caches
The UltraSPARC-IIi Level-1 D-cache is virtually indexed, physically tagged (VIPT). Virtual addresses are used to index into the D-cache tag and data arrays while accessing the D-MMU (that is, the dTLB). The resulting tag is compared against the translated physical address to determine D-cache hits. A side-effect inherent in a virtual-indexed cache is address aliasing; this issue is addressed in Section 8.2.1, “Address Aliasing Flushing” on page 68. The UltraSPARC-IIi Level-1 I-cache is physically indexed, physically tagged (PIPT). The lowest 13 bits of instruction addresses are used to index into the I-cache tag and data arrays while accessing the I-MMU (that is, the iTLB). The resulting tag is compared against the translated physical address to determine I-cache hits.
3.1.1.1
Instruction Cache (I-cache)
The I-cache is a 16 Kb pseudo-two-way set-associative cache with 32-byte blocks. The set is predicted based on the next fetch address; thus, only the index bits of an address are necessary to address the cache (that is, the lowest 13 bits, which matches the minimum page size of 8Kb). Instruction fetches bypass the instruction cache under the following conditions:
s
When the I-cache enable or I-MMU enable bits in the LSU_Control_Register are clear (see Section A.6, “LSU_Control_Register” on page 384)
19
s s
When the processor is in RED_state, or When the I-MMU maps the fetch as noncacheable
The instruction cache snoops stores from DMA transfers, but it is not updated by stores, except for block commit stores (see Section 13.5.3, “Block Load and Store Instructions” on page 172). The FLUSH instruction can be used to maintain coherency. Block commit stores invalidate I-cache but do not flush instructions that have already been prefetched into the pipeline. A FLUSH, DONE, or RETRY instruction can be used to flush the pipeline. For block copies that must maintain I-cache coherency, it is more efficient to use block commit stores in the loop, followed by a single FLUSH instruction to flush the pipeline.
Note – The size of each I-cache set is the same as the page size in UltraSPARC-IIi;
thus, the virtual index bits equal the physical index bits.
3.1.1.2
Data Cache (D-cache)
The D-cache is a write-through, nonallocating-on-write-miss, 16-kb direct mapped cache with two 16-byte sub-blocks per line. Data accesses bypass the data cache when the D-cache enable bit in the LSU_Control_Register is clear (see Section A.6, “LSU_Control_Register” on page 384). Load misses will not allocate in the D-cache if the D-MMU enable bit in the LSU_Control_Register is clear or the access is mapped by the D-MMU as virtual noncacheable.
Note – A noncacheable access may access data in the D-cache from an earlier
cacheable access to the same physical block, unless the D-cache is disabled. Software must flush the D-cache when changing a physical page from cacheable to noncacheable (see Section 8.2, “Cache Flushing”). In UltraSPARC-IIi, the noncacheable accesses must follow the physical address space definition, so that this issue should not occur.
3.1.2
Level-2 PIPT External Cache (E-cache)
The UltraSPARC-IIi E-cache (also known as level-2 cache) is physically indexed, physically tagged (PIPT). This cache has no virtual address or context information. The operating system needs no knowledge of such caches after initialization, except for stable storage management and error handling. Memory accesses must be cacheable in the E-cache. As a result, there is no E-cache enable bit in the LSU_Control_Register.
20
UltraSPARC-IIi User’s Manual • October 1997
Instruction fetches are directed to noncacheable PCI or UPA64s space when:
s s s
The I-MMU is disabled, or The processor is in RED_state, or The access is mapped by the I-MMU as physically noncacheable
Data accesses are to noncacheable PCI or UPA64s space when:
s s
The D-MMU enable bit (DM) in the LSU_Control_Register is clear, or The access is mapped by the D-MMU as nonphysical cacheable (unless ASI_PHYS_USE_EC is used)
Note – When noncacheable accesses are used, the associated addresses must be
legal according to the physical address map in TABLE 6-1 on page 36. The system must provide a noncacheable, ECC-less scratch memory for use of the booting code until the MMUs are enabled. The E-cache is a unified, write-back, allocating, direct-mapped cache. The E-cache always includes the contents of the I-cache and D-cache. The E-cache size can range from 256 kB to 2 MB with a line size is 64 bytes. See TABLE 1-1 on page 10. Block loads and block stores, which load or store a 64-byte line of data from memory to the floating-point register file, do not allocate into the E-cache, to avoid pollution.
Chapter 3
Cache Organization
21
22
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
4
Overview of I and D-MMUs
4.1
Introduction
Instruction and Data MMUs are similar and are generically referred to as “MMU.” This chapter describes the UltraSPARC-IIi Memory Management Unit as it is seen by the operating system software. The UltraSPARC-IIi MMU conforms to the requirements set forth in The SPARC Architecture Manual, Version 9.
Note – The UltraSPARC-IIi MMU does not conform to the SPARC-V8 Reference MMU Specification. In particular, the UltraSPARC-IIi MMU supports a 44-bit virtual address space, software TLB miss processing only (no hardware page table walk), simplified protection encoding, and multiple page sizes. All of these differ from features required of SPARC-V8 Reference MMUs.
4.2
Virtual Address Translation
The UltraSPARC-IIi MMU supports four page sizes: 8 kB, 64 kB, 512 kB, and 4 MB. It supports a 44-bit virtual address space, with 41 bits of physical address. During each processor cycle the UltraSPARC-IIi MMU provides one instruction and one data virtual-to-physical address translation. In each translation, the virtual page number is replaced by a physical page number, which is concatenated with the page offset to form the full physical address, as illustrated in FIGURE 4-1 on page 24. (This figure shows the full 64-bit virtual address, even though UltraSPARC-IIi supports only 44 bits of VA.)
23
8 k-byte Virtual Page Number 63 MMU 8 k-byte Physical Page Number 40
Page Offset 13 12 Page Offset 13 12 0 0
VA 8 kB
PA
64 k-byte Virtual Page Number 63 MMU 64 k-byte Physical Page Number 40 16 15 16 15
Page Offset 0 Page Offset 0
VA 64 kB
PA
512 k-byte Virtual Page Number 63 MMU 512 k-byte PPN 40 19 18 19 18
Page Offset 0 Page Offset 0
VA 512 kb
PA
4 M-byte Virtual Page Number 63 MMU 4 M-byte PPN 40 22 21 22 21
Page Offset 0 Page Offset 0
VA 4 MB
PA
FIGURE 4-1
Virtual-to-physical Address Translation for all Page Sizes
UltraSPARC-IIi implements a 44-bit virtual address space in two equal halves at the extreme lower and upper portions of the full 64-bit virtual address space. Virtual addresses between 0000 0800 0000 0000 16 and FFFF F7FF FFFF FFFF 16, inclusive, are termed “out of range” for UltraSPARC-IIi and are illegal. (In other words, virtual address bits VA must be either all zeros or all ones.) FIGURE 4-2 on page 25 illustrates the UltraSPARC-IIi virtual address space.
24
UltraSPARC-IIi User’s Manual • October 1997
FFFF FFFF FFFF FFFF
See Note (1)
FFFF F801 0000 0000 FFFF F800 0000 0000 FFFF F7FF FFFF FFFF
Out of Range VA (VA “Hole”)
0000 0800 0000 0000 0000 07FF FFFF FFFF 0000 07FE FFFF FFFF
See Note (1)
0000 0000 0000 0000
Note (1): Prior implementations restricted use of this region to data only.
FIGURE 4-2
UltraSPARC-IIi 44-bit Virtual Address Space, with Hole (Same as FIGURE 14-2 on page 184)
Note – Throughout this document, when virtual address fields are specified as 64bit quantities, they are assumed to be sign-extended based on VA. The operating system maintains translation information in a data structure called the Software Translation Table. The I- and D-MMU each contain a hardware Translation Lookaside Buffer (iTLB and dTLB). These buffers act as independent caches of the Software Translation Table, providing one-cycle translation for the more frequently accessed virtual pages.
FIGURE 4-3 on page 26 shows a general software view of the UltraSPARC-IIi MMU. The TLBs, which are part of the MMU hardware, are small and fast. The Software Translation Table, which is kept in memory, is likely to be large and complex. The Translation Storage Buffer (TSB), which acts like a direct-mapped cache, is the interface between the two. The TSB can be shared by all processes running on a processor, or it can be process specific. The hardware does not require any particular scheme.
The term “TLB hit” means that the desired translation is present in the MMUs onchip TLB. The term “TLB miss” means that the desired translation is not present in the MMUs on-chip TLB. On a TLB miss the MMU immediately traps to software for TLB miss processing. The TLB miss handler has the option of filling the TLB by any means available, but it is likely to take advantage of the TLB miss support features provided by the MMU, since the TLB miss handler is time-critical code. Hardware support is described in Section 15.3.1, “Hardware Support for TSB Access” on page 209.
Chapter 4 Overview of I and D-MMUs 25
Translation Look-aside Buffers
Translation Storage Buffer
Software Translation Table
MMU
FIGURE 4-3
Memory
O/S Data Structure
Software View of the UltraSPARC-IIi MMU
Aliasing between pages of different size (when multiple VAs map to the same PA) may take place, as with the SPARC-V8 Reference MMU. The reverse case, when multiple mappings from one VA/context to multiple PAs produce a multiple TLB match, is not detected in hardware; it produces undefined results.
Note – The hardware ensures the physical reliability of the TLB on multiple
matches.
26
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
5
UltraSPARC-IIi in a System
5.1
A Hardware Reference Platform
The elements of the hardware, the associated peripherals and their function can be presented by considering each one in the context of a hardware reference platform. FIGURE 5-1 shows a typical rendering of such a platform. This model assumes CPU and SRAM for the E-cache are provided on the same module, to keep the high-speed E-cache interface in a controlled electrical environment and away from the motherboard. A typical module uses five, 64 K x 18 register-latch SRAMs, to provide a 512-kilobyte E-cache. The reference platform provides support for two standard, 33 MHz, 32-bit, PCI busses, along with a 66 MHz, 32-bit PCI interface to a bus bridge ASIC, for example, the Advanced PCI Bridge (APB™). Graphics can be implemented using a PCI add-in card, or by means of a custom UPA64S solution.
27
UPA64S
UPA64S Address + Control UltraSPARC-i and L2 Cache
RIC
PCI Advanced PCI Bridge (APB)
/
PCI
32
MEMADDR + Control
/
MEMDATA (64 + ECC) Transceivers
32
Memory Data (128 + 16 ECC)
10/100 Mb Ethernet
Super I/O
Memory DIMMs (8)
10/100 XVER Floppy EIDE Connector
Kbd Ctrlr.
Port MII TP 8 KB TOD/ NVRAM
Serial A/B
I MB Boot PROM
FIGURE 5-1
Overview of UltraSPARC-IIi Reference Platform
5.2
Memory Subsystem
FIGURE 5-2 shows how memory is connected to, and controlled by, the
UltraSPARC-IIi. The memory DIMMs are arranged on a 144-bit bus to allow an entire cache line to be fetched in four CAS accesses. UltraSPARC-IIi implements ECC, with single-bit correction and multi-bit detection of errors, for all memory data transfers.
28
UltraSPARC-IIi User’s Manual • October 1997
TA(15:0) Tag TD(14+2+P) Clock UltraSPARC-IIi DA(18+8BE) 512 KB L2 Cache D(64+8P) Memory Address and Control
Data (64+8ECC)
Transceivers
Memory Data (128+16ECC)
Memory (8 DIMMs)
FIGURE 5-2
A Typical Subsystem: UltraSPARC-IIi and Memory—Simplified Block Diagram
5.2.1
E-cache
Synchronous access to the E-cache (L2-cache) is made through a data bus that carries 8-bytes plus parity. The UltraSPARC-I or UltraSPARC-II 1-1-1 style SRAMs can be used at half the processor clock rate. The UltraSPARC-II 2-2 mode SRAMS are also supported. There are enough cache address bits to support a 2 MB E-cache, with a practical minimum of 256 kB. E-cache can be fitted in these alternative configurations:
s s s
2 - 32k x 36 (data) plus 1-4k x 18 (minimally) (tag: can use 32k x 36) =256kbyte 4 - 64k x 18 (data) plus 1-8k x 18 (minimally) (tag: can use 32k x 36) =512kbyte 4 - 128k x 18 (data) plus 1-16k x 18 (minimally) (tag: can use 32k x 36) = 1mbyte
Chapter 5
UltraSPARC-IIi in a System
29
s s
2 - 128k x 36 (data) plus 1-16k x 18 (minimally) (tag: can use 32k x 36) = 1mbyte 4 - 256k x 18 (data) plus 1-32k x 18 (minimally) (tag: can use 32k x 36) = 2mbyte
As provided in UltraSPARC-II, UltraSPARC-IIi supports software programming to selectively zero E-cache tag address bits, so that the same module can accommodate different sizes of SRAM IC, without the necessity of tying unused address lines low—which must be done if an over-capacity SRAM is used.
5.2.2
DRAM Memory
The following are the major features of the DRAM modules utilized in UltraSPARC-IIi memory:
s
s
s s s s s s s s
Four DIMM pairs for up to 256 Megabytes, using 168-pin JEDEC DIMMs, with 16Megabit DRAM. Up to one Gigabyte, using 64-Megabit DRAM 144-bit DRAM data bus with 8-bit ECC on each 64 bits of data—industry standard ECC pinout High performance CMOS silicon gate process Single +3.3V ± 0.3 V power supply All device pins are 3.3 V compatible Low power, 9 mW standby; 1,800 mW active, typical Refresh modes: CAS-BEFORE-RAS (CBR) All inputs are buffered except RAS 2,048-cycle refresh distributed across 32 ms Extended Data Out (EDO) access cycles
The UltraSPARC-IIi memory design is built with JEDEC standard 168-pin DIMMs. The memory bus is 144 bits wide. RAS and CAS signals are provided that support a maximum of eight 8 - 128 megabyte DIMMs. A mode that supports 11-bit column addresses for 16M X 4, 64 megabit DRAMs allows a maximum of four 8 - 256 megabyte DIMMs. The memory bus width requires that the DIMMs be populated in pairs at a time. Consequently the minimum memory configuration contains 16 megabytes and the maximum memory configuration contains 1 gigabyte. These DIMMs are available from many vendors. A composite specification was made considering typical vendor specifications. When the UltraSPARC-IIi is programmed according to Chapter 18, “MCU Control and Status Registers, ” for a particular frequency and DIMM loading combination, it generates signals that meet this composite specification, if the electrical and topological motherboard layout requirements are met.
30
UltraSPARC-IIi User’s Manual • October 1997
5.2.3
Transceivers
The Texas Instruments SN74ALVC16268 is a bidirectional registered 12-bit-to-24-bit bus exchanger, with 3-state outputs. The transceiver transfers data bidirectionally between the 72-bit UltraSPARC-IIi memory data bus, and the 144-bit DIMM memory data bus. The DIMMs cycle data in EDO mode at 37.5 MHz maximum frequency—a period of 26.5 ns. The transceiver has bus-hold on data inputs, eliminating the need for external pullup resistors. It is available in 56-pin Plastic Shrink Small-Outline (DL) and Thin Shrink Small-Outline (DGG) packages. The ports connected to the DIMMs include the equivalent of 26Ω series resistors, to make external series termination resistors unnecessary. The device provides synchronous data exchange between the two ports. Data is stored in the internal registers on the low-to-high transition of the CLK input, provided that the appropriate CLKEN inputs are low. All control inputs, including the CLK inputs, are driven by UltraSPARC-IIi
5.3
PCI Interface—Advanced PCI Bridge
The PCI interface of UltraSPARC-IIi can be used directly or expanded using one or more PCI bridges. FIGURE 5-3 shows an example of the connection of an external PCI subsystem using Sun Microsystems, Inc. Advanced PCI Bridge (APB™). This configuration uses PCI clocks asynchronous with the processor clock and three or more PCI buses, all compatible with the existing PCI 2.1 standard:
s
One 66 MHz, 32-bit primary bus from UltraSPARC-IIi to APB; note that multiple APBs can be used for multiplying PCI connectivity Two 33 MHz, 32-bit secondary busses from each APB
s
Chapter 5
UltraSPARC-IIi in a System
31
72-bit path @75 MHz
Address/Control
DRAM DIMMS
UltraSPARC-IIi Core System SRAM SRAM SRAM Module 33/66 MHz 3.3V PCI UltraSPARC i-Series
72 Bit
72 Bit
144 bit path @37.5 MHz
XCVR
UPA64S Port APB Advanced PCI Bridge (Optional)
64 bit @ 100 MHz
4 APB chips may be used to support up to 32 PCI
I/O
ATM
10/100 MB
SCSI T1/E1 Super I/O I/O I/O
PCI @ (33MHz/5.0V)
FIGURE 5-3
UltraSPARC-IIi System Implementation Example
The interface from UltraSPARC-IIi with its I/O subsystems is a 32-bit PCI bus, which can run at either 33 or 66 MHz. UltraSPARC-IIi internal PLLs allow slower PCI bus clock rates, down to 20 MHz or 40 MHz for each range respectively. This allows use of more PCI targets than the 2.1 specification permits for full-speed operation. However, the PCI arbiters on UltraSPARC-IIi and APB only support four master requests per bus. The Advanced PCI Bridge (APB) allows external arbiters on the secondary buses. The UltraSPARC-IIi PCI interface runs at 3.3 V only. To support 5 V PCI cards, the Advanced PCI Bridge (APB) must be used, which also provides expansion from one 66 MHz 32-bit PCI bus, to two 32-bit 33 MHz PCI buses. APB provides up to 64-byte write posting and data prefetching, so that the delivered throughput can be higher than a single 33 MHz bus could provide. The secondary PCI buses have:
s
s
3.3 Volt operation and signalling, but are compatible with the PCI 5 V signalling environment definition. 32-bit data bus
32
UltraSPARC-IIi User’s Manual • October 1997
s s
Compatibility with the PCI Rev. 2.1 Specification Support for up to four master devices
Interrupts are not routed through the APB. A separate Drain/Empty protocol is used to guarantee that all DMA writes temporally complete to memory, prior to receipt of an interrupt, and thus before a potential processor trap as a result of that interrupt. The Primary bus, which can be used with or without the Advanced PCI Bridge, has the same characteristics discussed above, except it can run in the 20-33 MHz or the 40-66 MHz range. UltraSPARC-IIi operates internally at twice the external PCI clock frequency, that is, up to 132 MHz. This helps reduce the latency involved in crossing clock domains and manipulating state machines.
5.4
RIC Chip
The RIC Chip (SME2210) supports the system resets, system interrupts, system scan, and system clock control functions. Its features include:
s s
s
Support for resets from power supply, reset buttons, and scan Concentration of all of the interrupts; it sends interrupt numbers to the UltraSPARC-IIi Direction of SCAN inputs and outputs through scan chains
5.5
UPA64S interface (FFB)
UPA64S is a slave-only interface protocol used, for instance, by proprietary graphics boards. It can be used for any high bandwidth control or data transfers between the processor and a dedicated subsystem. Transfers to and from the UPA64S interface are fully synchronous, since UPA64S receives a PECL clock that is aligned with the processor’s clock. The processor transfers data on clock edges that correspond to the UPA64S clock edges. This interface runs at 1/3 of the processor clock rate, that is, up to 100 MHz. UltraSPARC-IIi drives the SYSADR (system address), ADR_VLD (address valid) signals, the S_REPLY handshake, and reset (RST_L) to the UPA64S. The data bus (64 bits out of 72) is shared with the transceiver connection to the UltraSPARC-IIi. The internal memory controller of the UltraSPARC-IIi transfers data aligned to processor clocks, but guarantees that UPA64S transfers appear aligned to the UPA64S clock. In other words, these are valid for three processor clock cycles, and only sampled on the UPA clock edge when UPA64S is driving.
Chapter 5
UltraSPARC-IIi in a System
33
Note that, although the transceivers only cycle the 72-bit MEMDATA at 75 MHz maximum, the FFB/UPA64S cycle this bus at up to 100 MHz.
5.6
Alternate RMTV support
UltraSPARC-IIi has a pin to select a second RMTV to allow use of PC compatible SuperIO chips on a PCI bus—see Section 17.3.2, “RED_state Trap Vector” on page 271.
5.7
Power Management
See Section 13.6.2, “SHUTDOWN” on page 179.
34
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
6
Address Spaces, ASIs, ASRs, and Traps
6.1
Overview
A SPARC-V9 processor provides an Address Space Identifier (ASI) with every address sent to memory. The ASI is used to distinguish between different address spaces, provide an attribute that is unique to an address space, and to map internal control and diagnostics registers within a processor. SPARC-V9 also extends the limit of virtual addresses from 32 to 64 bits for each address space. SPARC-V9 continues to support 32-bit addressing by masking the upper 32-bits of the 64-bit address to zero when the address mask (AM) bit in the PSTATE register is set. Both big- and little-endian byte orderings are supported in UltraSPARC-IIi. The default data access byte ordering after a Power-On Reset (POR) is big-endian. Instruction fetches are always big-endian.
6.2
Physical Address Space
The UltraSPARC-IIi memory management hardware uses a 44-bit virtual address and an 8-bit ASI to generate a 41-bit physical address. This physical address space can be accessed using either virtual-to-physical address mapping or the MMU bypass mode. For details of this mode See Section 15.10, “MMU Bypass Mode.”
35
6.2.1
Port Allocations
UltraSPARC-IIi divides its physical address space among: s DRAM s UPA64S (for a graphics device – FFB) s PCI, that is further subdivided into PCI A and B bus spaces, when the Advanced PCI Bridge (APB) is used. UltraSPARC-IIi Address Map
Size Port Addressed Access Type
TABLE 6-1
Address Range in PA
0x000.0000.0000 0x000.3FFF.FFFF 0x000.4000.0000 0x1FF.FFFF.FFFF 0x000.0000.0000 0x1FB.FFFF.FFFF 0x1FC.0000.0000 0x1FD.FFFF.FFFF 0x1FE.0000.0000 0x1FF.FFFF.FFFF
1 GB Do not use Do not use 8 GB 8 GB
Main Memory Undefined Undefined UPA64S (FFB)
Cacheable Cacheable Noncacheable Noncacheable Noncacheable
PCI
Only the Cacheability attribute and PA[33:32] are used for steering transactions. Note that, for compatibility with prior UltraSPARC systems, software should use PA[40:34] equal to all ‘1’s for noncacheable space, and all ‘0’s for cacheable space. UltraSPARC-IIi does not detect any errors associated with using a PA[40:34] that violates this convention. UltraSPARC-IIi also does not detect the error of using PA[33:32] in violation of the above cacheable/noncacheable partitioning. Consequently, all possible PA’s decode to some destination. DRAM accesses wrap at the 1 GB boundary, although 4 GB of cacheable space is supported by the L2 cache tags, so the L2 cache will wrap at 4 GB. Noncacheable destinations are determined only by PA[33:32].
6.2.2
Memory DIMM requirements
There can be 8 DIMMs ranging in size from eight MB to 128 MB. An alternate mode for supporting DRAM with 11-bit column addressing allows four DIMMs ranging in size from 8 MB to 256 MB. Each DIMM can have two banks of DRAM, controlled by separate RAS# signals.
36
UltraSPARC-IIi User’s Manual • October 1997
The Memory Controller timing is programmable, The assumption is that ADDR, CAS#, and WE# are buffered on the DIMM, and that RAS#, CAS# and WE# are buffered on the motherboard.
Note the prior address/cacheability map implies that it is impossible to cause noncacheable access to main memory. Parameters that affect the address assignments of each DIMM module are DIMM size and the pair in which the DIMM is installed. DIMMs must be loaded in pairs. If the same size memory DIMMs are not installed within a pair, software should either disable the pair, or configure it to match the smaller sized DIMM. Any mixture of sizes is permitted among pairs. Software can identify the type and size of a DIMM in the system from its address range which is unique for each DIMM type and size. See TABLE 7-2 on page 63 or TABLE 7-4 on page 66 for the DIMM to PA mapping.
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
37
6.2.3
PCI Address Assignments
TABLE 6-2
Physical address space to PCI space
PA[40:0] CPU Commands Supported PCI Commands Generated
PCI Address Space
PCI Configuration Space
0x1FE.0100.00000x1FE.01FF.FFFF 0x1FE.0200.00000x1FE.02FF.FFFF 0x1FE.0300.00000x1FE.FFFF.FFFF
NC read (any) NC write (any) NC read (any) NC write (any)
Configuration Read Configuration Write (may also be Special Cycle) I/O Read I/O Write May wrap to Configuration or I/O Space behavior
PCI Bus I/O Space
Don’t Use
PCI Bus Memory Space
0x1FF.0000.00000x1FF.FFFF.FFFF
NC NC NC NC NC NC
read (4 byte) read (8 byte) Block read write Block write Instruction fetch
Memory Memory Memory Memory Memory Memory
Read Read Multiple Read Line Write Write Read
TABLE 6-3 PA[40:0]
Additional Internal UltraSPARC-IIi CSR space (noncacheable)
Owner
0x1FE.0000.0000 - 0x1FE.0000.01FF 0x1FE.0000.0200 - 0x1FE.0000.03FF 0x1FE.0000.0400 - 0x1FE.0000.1FFF 0x1FE.0000.2000 - 0x1FE.0000.5FFF 0x1FE.0000.6000 - 0x1FE.0000.9FFF 0x1FE.0000.A000 - 0x1FE.0000.A7FF 0x1FE.0000.A800 - 0x1FE.0000.EFFF 0x1FE.0000.F000 - 0x1FE.00FF.F018 0x1FE.00FF.F020 0x1FE.0000.F028 - 0x1FE.00FF.FFFF
PBM IOM PIE PBM PIE IOM PIE MCU PIE MCU
38
UltraSPARC-IIi User’s Manual • October 1997
6.2.4
Probing the address space
Generally, systems are configurable, and the boot prom needs to determine what exact configuration is present. There are three address spaces to interrogate: DRAM, UPA64S and PCI. DRAM probing is explained in detail by Section A.10.2, “Memory Probing” on page 397. Probing for PCI devices is done using PCI Configuration space accesses. To handle non-response to some of these accesses, software should synchronize on traps as described by Section 16.2.1, “Probing PCI during boot using deferred errors” on page 241. Also see Section 16.5, “Summary of Error Reporting” on page 249 Unlike as for PCI, there is no trapping for non-reply to UPA64S transactions. If the motherboard ties the P_REPLY[1:0] (UPA64S acknowledgment signals) high during power-on reset, the MCU will assume it received a handshake for all loads and stores targeting the UPA64S address space. This allows software to look for a specific known data pattern being returned by a UPA64S device to report existence. The MCU behavior prevents the software from hanging if no UPA64S device is present. APB existence can be determined by probing APB-specific registers. See the APB specification for details. UltraSPARC-IIi does not support any UPA-compliant probing algorithm, other than as described.
6.3
Alternate Address Spaces
The SPARC-V9 Address Space Identifier (ASI) is divided into restricted and nonrestricted halves. ASIs in the range 00 16 ..7F16 are restricted; ASIs in the range 8016 .. FF16 are non-restricted. An attempt by non-privileged software to access a restricted ASI causes a data_access_exception trap. ASIs in the ranges 0416 .. 1116, 1816..1916, 2416..2C16, 7016 .. 7316, 7816..7916 and 8016 .. FF16 are called “normal” or “translating” ASIs. These ASIs are translated by the MMU. Bypass ASIs are in the range 1416..1516 and 1C16 .. 1D16. These ASIs are not translated by the MMU; instead, they pass through their virtual addresses as physical addresses.
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
39
UltraSPARC-IIi Internal ASIs (also called “Nontranslating ASIs”) are in the ranges 4516 .. 6F16, 7616 .. 7716 and 7E16..7F16. These ASIs are not translated by the MMU; instead, they pass through their virtual addresses as physical addresses. Accesses made using these ASIs are always made in “big-endian” mode, regardless of the setting of the D-MMU’s IE bit. Accesses to Internal ASIs with invalid virtual address have undefined behavior; they may or may not cause a data_access_exception trap. They may or may not alias onto a valid virtual address. Software should not rely on any specific behavior.
Note – MEMBAR #Sync is generally needed after stores to internal ASIs. A FLUSH, DONE, or RETRY is needed after stores to internal ASIs that affect instruction accesses. See Section 8.3.8, “Instruction Prefetch to Side-Effect Locations” on page 79.
6.3.1
Supported SPARC-V9 ASIs
The SPARC-V9 architecture defines several address spaces that must be supported by a conforming processor. They are listed in TABLE 6-4. All operand sizes are supported in these accesses. See Appendix G, “ASI Names” for an alphabetical listing of ASI names and macro syntax.
Mandatory SPARC-V9 ASIs
Access Description Section
TABLE 6-4 ASI Value
ASI Name (Suggested Macro Syntax)
0416 0C16 1016 1116 1816 1916 8016 8116 8216
ASI_NUCLEUS (ASI_N) ASI_NUCLEUS_LITTLE (ASI_NL) ASI_AS_IF_USER_PRIMARY (ASI_AIUP) ASI_AS_IF_USER_SECONDARY (ASI_AIUS) ASI_AS_IF_USER_PRIMARY_LITTLE (ASI_AIUPL) ASI_AS_IF_USER_SECONDARY_LITTLE (ASI_AIUSL) ASI_PRIMARY (ASI_P) ASI_SECONDARY (ASI_S) ASI_PRIMARY_NO_FAULT (ASI_PNF)
RW RW RW2 RW2 RW2 RW2 RW RW R1
Implicit address space; nucleus privilege; TL > 0 Implicit address space; nucleus privilege; TL > 0; little endian Primary address space; user privilege Secondary address space; user privilege Primary address space; user privilege; little endian Secondary address space; user privilege; little endian Implicit primary address space Implicit secondary address space Primary address space; no fault
V9 V9 V9 V9 V9 V9 V9 V9 V9, 14.4.6
40
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-4 ASI Value
Mandatory SPARC-V9 ASIs (Continued)
Access Description Section
ASI Name (Suggested Macro Syntax)
8316 8816 8916 8A16 8B16
ASI_SECONDARY_NO_FAULT (ASI_SNF) ASI_PRIMARY_LITTLE (ASI_PL) ASI_SECONDARY_LITTLE (ASI_SL) ASI_PRIMARY_NO_FAULT_LITTLE (ASI_PNFL) ASI_SECONDARY_NO_FAULT_LITTLE (ASI_SNFL)
R1 RW RW R1 R1
Secondary address space; no fault Implicit primary address space; little endian Implicit secondary address space; little endian Primary address space; no fault; little endian Secondary address space; no fault; little endian
V9, 14.4.6 V9 V9 V9, 14.4.6 V9, 14.4.6
1 2
Read-only access; causes a data_access_exception trap if written respectively. Causes a data_access_exception trap if the page being accessed is privileged.
6.3.2
UltraSPARC-IIi (Non-SPARC-V9) ASI Extensions
TABLE 6-5 on page 42 defines all non-SPARC-V9 ASI extensions supported in UltraSPARC-IIi. These ASIs may be used with LDXA, STXA, LDDFA, STDFA instructions only, unless otherwise noted. Other length accesses will cause a data_access_exception trap. See Appendix G, “ASI Names” for an alphabetical listing of ASI names and macro syntax.
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
41
TABLE 6-5 ASI Value
UltraSPARC-IIi Extended (non-SPARC-V9) ASIs
VA Access Description Section
ASI Name (Suggested Macro Syntax)
1416 1516
ASI_PHYS_USE_EC (ASI_PHYS_USE_EC) ASI_PHYS_BYPASS_EC_WITH_EBIT (ASI_PHYS_BYPASS_EC_WITH_EBIT) ASI_PHYS_USE_EC_LITTLE (ASI_PHYS_USE_EC_L) ASI_PHYS_BYPASS_EC_WITH_EBIT_LIT TLE (ASI_PHYS_BYPASS_EC_WITH_EBIT_L) ASI_NUCLEUS_QUAD_LDD (ASI_NUCLEUS_QUAD_LDD) ASI_NUCLEUS_QUAD_LDD_LITTLE (ASI_NUCLEUS_QUAD_LDD_L) ASI_LSU_CONTROL_REG (ASI_LSU_CONTROL_REG) ASI_DCACHE_DATA (ASI_DCACHE_DATA) ASI_DCACHE_TAG (ASI_DCACHE_TAG) ASI_INTR_DISPATCH_STATUS (ASI_INTR_DISPATCH_STATUS) ASI_INTR_RECEIVE (ASI_INTR_RECEIVE) ASI_UPA_CONFIG_REG (ASI_UPA_CONFIG_REG) ASI_ESTATE_ERROR_EN_REG (ASI_ESTATE_ERROR_EN_REG) ASI_ASYNC_FAULT_STATUS (ASI_ASYNC_FAULT_STATUS) ASI_ASYNC_FAULT_ADDRESS (ASI_ASYNC_FAULT_ADDRESS) ASI_ECACHE_TAG_DATA (ASI_EC_TAG_DATA) ASI_IMMU (ASI_IMMU) ASI_IMMU (ASI_IMMU) ASI_IMMU (ASI_IMMU) ASI_IMMU (ASI_IMMU)
—
RW 2,5 RW2
Physical address; external cacheable only Physical address; noncacheable; with side effect Physical address; external cacheable only; little endian Physical address; noncacheable; with sideeffect; little endian Cacheable; 128-bit atomic LDDA Cacheable; 128-bit atomic LDDA; little endian Load/store unit control register D-cache data RAM diagnostics access D-cache tag/valid RAM diagnostics access Interrupt vector dispatch status Interrupt vector receive status UPA configuration register E-cache error enable register ECU Asynchronous fault status register ECU Asynchronous fault address register E-cache tag/valid RAM data diagnostic access I-MMU Tag Target Register I-MMU Synchronous Fault Status Register I-MMU TSB Register I-MMU TLB Tag Access Register
15.10 15.10
—
1C16
15.10
—
RW 2,5
1D16
15.10
—
RW 2 R 1,3 R 1,3 RW RW RW R1 RW RW RW RW RW RW R1 RW RW RW
2416 2C16 4516 4616 4716 4816 4916 4A16 4B16 4C16 4D16 4E16 5016 5016 5016 5016
— — 016 — — 016 016 016 016 016 016 016 016 1816 2816 3016
13.6.1 13.6.1 A.6 A.8.1 A.8.2 11.10.3 11.10.5 18.5 16.6.1 16.6.2 16.6.3 A.9.2 15.9.2 15.9.4 15.9.6 15.9.7
42
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-5 ASI Value
UltraSPARC-IIi Extended (non-SPARC-V9) ASIs (Continued)
VA Access Description Section
ASI Name (Suggested Macro Syntax)
5116 5216 5416 5516 5616 5716 5816 5816 5816 5816 5816 5816 5816 5816 5816 5916 5A16 5B16 5C16 5D16 5E16
ASI_IMMU_TSB_8KB_PTR_REG (ASI_IMMU_TSB_8KB_PTR_REG) ASI_IMMU_TSB_64KB_PTR_REG (ASI_IMMU_TSB_64KB_PTR_REG) ASI_ITLB_DATA_IN_REG (ASI_ITLB_DATA_IN_REG) ASI_ITLB_DATA_ACCESS_REG (ASI_ITLB_DATA_ACCESS_REG) ASI_ITLB_TAG_READ_REG (ASI_ITLB_TAG_READ_REG) ASI_IMMU_DEMAP (ASI_IMMU_DEMAP) ASI_DMMU (ASI_D-MMU) ASI_DMMU (ASI_DMMU) ASI_DMMU (ASI_DMMU) ASI_DMMU (ASI_DMMU) ASI_DMMU (ASI_DMMU) ASI_DMMU (ASI_DMMU) ASI_DMMU (ASI_DMMU) ASI_DMMU (ASI_DMMU) ASI_DMMU (ASI_DMMU) ASI_DMMU_TSB_8KB_PTR_REG (ASI_DMMU_TSB_8KB_PTR_REG) ASI_DMMU_TSB_64KB_PTR_REG (ASI_DMMU_TSB_64KB_PTR_REG) ASI_DMMU_TSB_DIRECT_PTR_REG (ASI_DMMU_TSB_DIRECT_PTR_REG) ASI_DTLB_DATA_IN_REG (ASI_DTLB_DATA_IN_REG) ASI_DTLB_DATA_ACCESS_REG (ASI_DTLB_DATA_ACCESS_REG) ASI_DTLB_TAG_READ_REG (ASI_DTLB_TAG_READ_REG)
016 016 016 016..1F816 016..1F816 016 016 816 1016 1816 2016 2816 3016 3816 4016 016 016 016 016 016..1F816 016..1F816
R1 R1 W1 RW R1 W1 R1 RW RW RW R1 RW RW RW RW R1 R1 R1 W1 RW R1
I-MMU TSB 8KB Pointer Register I-MMU TSB 64KB Pointer Register I-MMU TLB Data In Register I-MMU TLB Data Access Register I-MMU TLB Tag Read Register I-MMU TLB demap D-MMU Tag Target Register I/D MMU Primary Context Register D-MMU Secondary Context Register D-MMU Synch. Fault Status Register D-MMU Synch. Fault Address Register D-MMU TSB Register D-MMU TLB Tag Access Register D-MMU VA Data Watchpoint Register D-MMU PA Data Watchpoint Register D-MMU TSB 8K Pointer Register D-MMU TSB 64K Pointer Register D-MMU TSB Direct Pointer Register D-MMU TLB Data In Register D-MMU TLB Data Access Register D-MMU TLB Tag Read Register
15.9.8 15.9.8 15.9.9 15.9.9 15.9.9 15.9.10 15.9.2 15.9.3 15.9.3 15.9.4 15.9.5 15.9.6 15.9.7 A.5.3 A.5.4 15.9.8 15.9.8 15.9.8 15.9.9 15.9.9 15.9.9
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
43
TABLE 6-5 ASI Value
UltraSPARC-IIi Extended (non-SPARC-V9) ASIs (Continued)
VA Access Description Section
ASI Name (Suggested Macro Syntax)
5F16 6616 6716 6E16 6F16 7016
ASI_DMMU_DEMAP (ASI_DMMU_DEMAP) ASI_ICACHE_INSTR (ASI_IC_INSTR) ASI_ICACHE_TAG (ASI_IC_TAG) ASI_ICACHE_PRE_DECODE (ASI_IC_PRE_DECODE) ASI_ICACHE_NEXT_FIELD (ASI_IC_NEXT_FIELD) ASI_BLOCK_AS_IF_USER_PRIMARY (ASI_BLK_AIUP) ASI_BLOCK_AS_IF_USER_SECONDARY (ASI_BLK_AIUS) ASI_ECACHE_W (ASI_EC_W) ASI_ECACHE_W (ASI_EC_W) ASI_SDBH_ERROR_REG_WRITE (ASI_SDB_ERROR_W) ASI_SDBL_ERROR_REG_WRITE (ASI_SDB_ERROR_W) ASI_SDBH_CONTROL_REG_WRITE (ASI_SDB_CONTROL_W) ASI_SDBL_CONTROL_REG_WRITE (ASI_SDB_CONTROL_W) ASI_SDB_INTR_W (ASI_SDB_INTR_W)
016 — — — — —
W1 RW3 RW3 RW3 RW3 RW4,6
DMMU TLB demap I-cache instruction RAM diagnostic access I-cache tag/valid RAM diagnostic access I-cache pre-decode RAM diagnostics access I-cache next-field RAM diagnostics access Primary address space; block load/store; user privilege Secondary address space; block load/store; user privilege E-cache data RAM diagnostic write access E-cache tag/valid RAM diagnostic write access External UDB Error Register; write high External UDB Error Register; write low External UDB Control Register; write high External UDB Control Register; write low Interrupt vector dispatch
15.9.10 A.7.1 A.7.2 A.7.3 A.7.4 13.5.3
7116
—
RW4,6
13.5.3
7616 7616 7716 7716 7716 7716 7716
=1 =2 016 1816 2016 3816 =MI D, = 7016 4016 5016 6016 —
W1 W1 W1 W1 W1 W1 W1
A.9.1 A.9.2 16.6.4 16.6.5 16.6.6 16.6.7 11.10.2
7716 7716 7716 7816
ASI_SDB_INTR_W (ASI_SDB_INTR_W) ASI_SDB_INTR_W (ASI_SDB_INTR_W) ASI_SDB_INTR_W (ASI_SDB_INTR_W) ASI_BLOCK_AS_IF_USER_PRIMARY_LI TTLE (ASI_BLK_AIUPL)
W1 W1 W1 RW 4
Outgoing interrupt vector data register 0 Outgoing interrupt vector data register 1 Outgoing interrupt vector data register 2 Primary address space; block load/store; user privilege; little endian
11.10.1 11.10.1 11.10.1 13.5.3
44
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-5 ASI Value
UltraSPARC-IIi Extended (non-SPARC-V9) ASIs (Continued)
VA Access Description Section
ASI Name (Suggested Macro Syntax)
7916
ASI_BLOCK_AS_IF_USER_SECONDARY _LITTLE (ASI_BLK_AIUSL) ASI_ECACHE_R (ASI_EC_R) ASI_ECACHE_R (ASI_EC_R) ASI_SDBH_ERROR_REG_READ (ASI_SDBH_ERROR_R) ASI_SDBL_ERROR_REG_READ (ASI_SDBL_ERROR_R) ASI_SDBH_CONTROL_REG_READ (ASI_SDBH_CONTROL_R) ASI_SDBL_CONTROL_REG_READ (ASI_SDBL_CONTROL_R) ASI_SDB_INTR_R ASI_SDB_INTR_R ASI_SDB_INTR_R ASI_INT_ACK ASI_PST8_PRIMARY (ASI_PST8_P) ASI_PST8_SECONDARY (ASI_PST8_S) ASI_PST16_PRIMARY (ASI_PSY16_P) ASI_PST16_SECONDARY (ASI_PST16_S) ASI_PST32_PRIMARY (ASI_PST32_P) ASI_PST32_SECONDARY (ASI_PST32_S) ASI_PST8_PRIMARY_LITTLE (ASI_PST8_PL) ASI_PST8_SECONDARY_LITTLE (ASI_PST8_SL)
—
RW4
Secondary address space; block load/store; user privilege; little endian E-cache data RAM diagnostic read access E-cache tag/valid RAM diagnostic read access External SDB Error Register; read high External SDB Error Register; read low External SDB Control Register; read high External SDB Control Register; read low Incoming interrupt vector data register 0 Incoming interrupt vector data register 1 Incoming interrupt vector data register 2 PCI interrupt acknowledge register Primary address space, 8 8-bit partial store Secondary address space. 8 8-bit partial store Primary address space, 4 16-bit partial store Secondary address space,4; 16-bit partial store Primary address space, 2; 32-bit partial store Secondary address space, 2; 32-bit partial store Primary address space, 8; 8-bit partial store, little endian Secondary address space, 8; 8-bit partial store, little endian
13.5.3
7E16 7E16 7F16 7F16 7F16 7F16 7F16 7F16 7F16 7F16 C016 C116 C216 C316
=1 =2 016 1816 2016 3816 4016 5016 6016 — — — — —
R1 R1 R1 R1 R1 R1 R1 R1 R1 R W1,4 W1,4 W1,4 W1,4
A.8.1 A.8.2 16.6.4 16.6.5 16.6.6 16.6.7 11.10.4 11.10.4 11.10.4 9.2.6 13.5.1 13.5.1 13.5.1 13.5.1
C416 C516 C816
— — —
W1,4 W1,4 W1,4
13.5.1 13.5.1 13.5.1
C916
—
W1,4
13.5.1
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
45
TABLE 6-5 ASI Value
UltraSPARC-IIi Extended (non-SPARC-V9) ASIs (Continued)
VA Access Description Section
ASI Name (Suggested Macro Syntax)
CA16
ASI_PST16_PRIMARY_LITTLE (ASI_PST16_PL) ASI_PST16_SECONDARY_LITTLE (ASI_PST16_SL) ASI_PST32_PRIMARY_LITTLE (ASI_PST32_PL) ASI_PST32_SECONDARY_LITTLE (ASI_PST32_SL) ASI_FL8_PRIMARY (ASI_FL8_P) ASI_FL8_SECONDARY (ASI_FL8_S) ASI_FL16_PRIMARY (ASI_Fl16_P) ASI_FL16_SECONDARY (ASI_FL16_S) ASI_FL8_PRIMARY_LITTLE (ASI_FL8_PL) ASI_FL8_SECONDARY_LITTLE (ASI_FL8_SL) ASI_FL16_PRIMARY_LITTLE (ASI_FL16_PL) ASI_FL16_SECONDARY_LITTLE (ASI_FL16_SL) ASI_BLK_COMMIT_PRIMARY (ASI_BLK_COMMIT_P) ASI_BLK_COMMIT_SECONDARY (ASI_BLK_COMMIT_S) ASI_BLOCK_PRIMARY (ASI_BLK_P)
—
W1,4
Primary address space,4; 16-bit partial store, little endian Secondary address space,4; 16-bit partial store, little endian Primary address space, 2; 32-bit partial store; little endian Secondary address space, 2; 32-bit partial store; little endian Primary address space, one; 8-bit floating point load/store Secondary address space, one; 8-bit floating point load/store Primary address space, one; 16-bit floating point load/store Secondary address space, one; 16-bit floating point load/store Primary address space, one; 8-bit floating point load/store, little endian Secondary address space, one; 8-bit floating point load/store, little endian Primary address space, one; 16-bit floating point load/store, little endian Secondary address space, one; 16-bit floating point load/store; little endian Primary address space; block store commit operation Secondary address space; block store commit operation Primary address space; block load/store
13.5.1
CB16
—
W1,4
13.5.1
CC16
—
W1,4
13.5.1
CD16
—
W 1,4
13.5.1
D016
—
RW 4
13.5.2
D116
—
RW 4
13.5.2
D216
—
RW 4
13.5.2
D316
—
RW 4
13.5.2
D816
—
RW 4
13.5.2
D916
—
RW 4
13.5.2
DA16
—
RW 4
13.5.2
DB16
—
RW 4
13.5.2
E016
—
W1,4
13.5.3
E116
—
W1,4
13.5.3
F016
—
RW 4
13.5.3
46
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-5 ASI Value
UltraSPARC-IIi Extended (non-SPARC-V9) ASIs (Continued)
VA Access Description Section
ASI Name (Suggested Macro Syntax)
F116 F816
ASI_BLOCK_SECONDARY (ASI_BLK_S) ASI_BLOCK_PRIMARY_LITTLE (ASI_BLK_PL) ASI_BLOCK_SECONDARY_LITTLE (ASI_BLK_SL)
— —
RW 4 RW 4
Secondary address space; block load/store Primary address space; block load/store; little endian Secondary address space; block load/store; little endian
13.5.3 13.5.3
F916
—
RW 4
13.5.3
1. 2. 3. 4. 5. 6.
Read-/write-only accesses cause a data_access_exception trap if written/read respectively. 8-/16-/32-/64-bit accesses allowed. LDDA, STDFA or STXA only. Other types of access cause a data_access_exception trap. LDDFA/STDFA only. Other types of access cause a data_access_exception trap. Can be used with LDSTUBA, SWAPA, CAS(X)A. Causes a data_access_exception trap if the page being accessed is privileged.
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
47
6.4
Summary of CSRs mapped to the Noncacheable address space
TABLE 6-6
CSRs Mapped to Non-cacheable Address Space
Register Access Size Section
PA
0x1FE.0000.0000 0x1FE.0000.0008 0x1FE.0000.0010 0x1FE.0000.0020 0x1FE.0000.0030 0x1FE.0000.0038 0x1FE.0000.0040 0x1FE.0000.0048 0x1FE.0000.0100 0x1FE.0000.0108 0x1FE.0000.0200 0x1FE.0000.0208 0x1FE.0000.0210 0x1FE.0000.0C00 0x1FE.0000.0C08 0x1FE.0000.0C10 0x1FE.0000.0C18 0x1FE.0000.0C20 0x1FE.0000.0C28 0x1FE.0000.0C30 0x1FE.0000.0C38 0x1FE.0000.1000 0x1FE.0000.1008 0x1FE.0000.1010 0x1FE.0000.1018 0x1FE.0000.1020 0x1FE.0000.1028 0x1FE.0000.1030 0x1FE.0000.1038 0x1FE.0000.1040
Undefined (alias to other csrs); was UPA PortID Undefined (alias to other csrs); was UPA Config Reserved Reserved DMA UE AFSR DMA UE/CE AFAR DMA CE AFSR DMA UE/CE AFAR (aliases to 0x1fe.0000.0038) Reserved Reserved IOMMU Control Register IOMMU TSB Base Address Reg IOMMU Flush Register PCI Bus A Slot 0 Int Mapping Reg PCI Bus A Slot 1 Int Mapping Reg PCI Bus A Slot 2 Int Mapping Reg PCI Bus A Slot 3 Int Mapping Reg PCI Bus B Slot 0 Int Mapping Reg PCI Bus B Slot 1 Int Mapping Reg PCI Bus B Slot 2 Int Mapping Reg PCI Bus B Slot 3 Int Mapping Reg SCSI Int Mapping Reg Ethernet Int Mapping Reg Parallel Port Int Mapping Reg Audio Record Int Mapping Reg Audio Playback Int Mapping Reg Power Fail Int Mapping Reg Kbd/mouse/serial Int Mapping Reg Floppy Int Mapping Reg Spare HW Int Mapping Reg
8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 19.3.2.1 19.3.2.2 19.3.2.3 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.4.3.1 19.4.3.2 19.4.3.3 19.4.3.2
48
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-6 PA
CSRs Mapped to Non-cacheable Address Space (Continued)
Register Access Size Section
0x1FE.0000.1048 0x1FE.0000.1050 0x1FE.0000.1058 0x1FE.0000.1060 0x1FE.0000.1068 0x1FE.0000.1070 0x1FE.0000.1078 0x1FE.0000.1080 0x1FE.0000.1088 0x1FE.0000.1090 0x1FE.0000.1098 0x1FE.0000.10A0 0x1FE.0000.14000x1FE.0000.1418 0x1FE.0000.14200x1FE.0000.1438 0x1FE.0000.14400x1FE.0000.1458 0x1FE.0000.14600x1FE.0000.1478 0x1FE.0000.14800x1FE.0000.1498 0x1FE.0000.14A00x1FE.0000.14B8 0x1FE.0000.14C00x1FE.0000.14D8 0x1FE.0000.14E00x1FE.0000.14F8 0x1FE.0000.1800 0x1FE.0000.1808 0x1FE.0000.1810 0x1FE.0000.1818 0x1FE.0000.1820 0x1FE.0000.1828 0x1FE.0000.1830 0x1FE.0000.1838 0x1FE.0000.1840
Keyboard Int Mapping Reg Mouse Int Mapping Reg Serial Int Mapping Reg Reserved Reserved DMA UE Int Mapping Reg DMA CE Int Mapping Reg PCI Error Int Mapping Reg Reserved Reserved On board graphics Int Mapping Reg (also mapped at 0x1FE.0000.6000) Expansion UPA64S Int Mapping Reg (also mapped at 0x1FE.0000.8000) PCI Bus A Slot 0 Clear Int Regs PCI Bus A Slot 1 Clear Int Regs PCI Bus A Slot 2 Clear Int Regs PCI Bus A Slot 3 Clear Int Regs PCI Bus B Slot 0 Clear Int Regs PCI Bus B Slot 1 Clear Int Regs PCI Bus B Slot 2 Clear Int Regs PCI Bus B Slot 3 Clear Int Regs SCSI Clear Int Reg Ethernet Clear Int Reg Parallel Port Clear Int Reg Audio Record Clear Int Reg Audio Playback Clear Int Reg Power Fail Clear Int Reg Kbd/mouse/serial Clear Int Reg Floppy Clear Int Reg Spare HW Clear Int Reg
8 bytes 8 bytes 8 bytes
19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1 19.3.3.1
8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes
19.3.3.1 19.3.3.1 19.3.3.1
19.3.3.2 19.3.3.2 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
49
TABLE 6-6 PA
CSRs Mapped to Non-cacheable Address Space (Continued)
Register Access Size Section
0x1FE.0000.1848 0x1FE.0000.1850 0x1FE.0000.1858 0x1FE.0000.1860 0x1FE.0000.1868 0x1FE.0000.1870 0x1FE.0000.1878 0x1FE.0000.1880 0x1FE.0000.1888 0x1FE.0000.1890 0x1FE.0000.1A00 0x1FE.0000.1C00 0x1FE.0000.1C08 0x1FE.0000.1C10 0x1FE.0000.1C18 0x1FE.0000.1C20 0x1FE.0000.2000 0x1FE.0000.2010 0x1FE.0000.2018 0x1FE.0000.2020 0x1FE.0000.2028 0x1FE.0000.2800 0x1FE.0000.2808 0x1FE.0000.2810 0x1FE.0000.4800 0x1FE.0000.4808 0x1FE.0000.4810 0x1FE.0000.5000 0x1FE.0000.5038 0x1FE.0000.5100 0x1FE.0000.5138 0x1FE.0000.51C0 0x1FE.0000.6000 0x1FE.0000.8000 0x1FE.0000.A000
Keyboard Clear Int Reg Mouse Clear Int Reg Serial Clear Int Reg Reserved Reserved DMA UE Clear Int Reg DMA CE Clear Int Reg PCI Error Clear Int Reg Reserved Reserved Reserved Reserved Reserved Reserved Reserved PCI DMA Write Synchronization Register PCI Control/Status Register PCI PIO Write AFSR PCI PIO Write AFAR PCI Diagnostic Register PCI Target Address Space Register Reserved Reserved Reserved Reserved Reserved Reserved PIO Buffer Diag Access DMA Buffer Diag Access DMA Buffer Diag Access (72:64) On board graphics Int Mapping Reg (also mapped at 0x1FE.0000.1098) Expansion UPA64S Int Mapping Reg (also mapped at 0x1FE.0000.10A0) Reserved
8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8bytes 8bytes 8 bytes
19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3 19.3.3.3
19.3.0.5 19.3.0.1 19.3.0.2 19.3.0.2 19.3.0.3 19.3.0.4
19.3.0.6 19.3.0.7 19.3.0.8 19.3.3.2 19.3.3.2
50
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-6 PA
CSRs Mapped to Non-cacheable Address Space (Continued)
Register Access Size Section
0x1FE.0000.A008 0x1FE.0000.A400 0x1FE.0000.A408 0x1FE.0000.A5000x1FE.0000.A57F 0x1FE.0000.A5800x1FE.0000.A5FF 0x1FE.0000.A6000x1FE.0000.A67F 0x1FE.0000.A800 0x1FE.0000.A808 0x1FE.0000.B0000x1FE.0000.B3FF 0x1FE.0000.B4000x1FE.0000.B7FF 0x1FE.0000.B8000x1FE.0000.B87F 0x1FE.0000.B9000x1FE.0000.B97F 0x1FE.0000.C0000x1FE.0000.C3FF 0x1FE.0000.C4000x1FE.0000.C7FF 0x1FE.0000.C8000x1FE.0000.C87F 0x1FE.0000.C9000x1FE.0000.C97F 0x1FE.0000.F000 0x1FE.0000.F010 0x1FE.0000.F018 0x1FE.0000.F020 0x1FE.0100.0000 0x1FE.0100.0002 0x1FE.0100.0004 0x1FE.0100.0006 0x1FE.0100.0008 0x1FE.0100.0009 0x1FE.0100.000A 0x1FE.0100.000B
Reserved IOMMU Virtual Address Diag Reg IOMMU Tag Compare Diag Reserved IOMMU Tag Diag IOMMU Data RAM Diag PCI Int State Diag Reg OBIO and Misc Int State Diag Reg Reserved Reserved Reserved Reserved Reserved Reserved Reserved Reserved FFB_Config MC_Control0 MC_Control1 Reset_Control PCI Configuration Space: Vendor ID PCI Configuration Space: Device ID PCI Configuration Space: Command PCI Configuration Space: Status PCI Configuration Space: Revision ID PCI Configuration Space: Programming I/F Code PCI Configuration Space: Sub-class Code PCI Configuration Space: Base Class Code
8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 2 bytes 2 bytes 2 bytes 2 bytes 2 bytes 1 byte 1 byte 1 byte 19.3.1.1 19.3.1.2 19.3.1.3 19.3.1.4 19.3.1.5 19.3.1.6 19.3.1.7 19.3.1.8 19.3.2.4 19.3.2.5 19.3.3.4 19.3.2.6 19.3.2.7
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
51
TABLE 6-6 PA
CSRs Mapped to Non-cacheable Address Space (Continued)
Register Access Size Section
0x1FE.0100.000D 0x1FE.0100.000E 0x1FE.0100.0040 0x1FE.0100.0041 0x1FE.0100.00420x1FE.0100.07FF 0x1FE.0200.00000x1FE.02FF.FFFF 0x1FF.0000.00000x1FF.FFFF.FFFF
PCI Configuration Space: Latency Timer PCI Configuration Space: Header Type PCI Configuration Space: Bus Number PCI Configuration Space: Subordinate Bus Number Reserved PCI Bus I/O Space PCI Bus Memory Space
1 byte 1 byte 1 byte 1 byte Any Any Any
19.3.1.9 19.3.1.10 19.3.1.11 19.3.1.11
Compatibility Note – A read of any addresses labelled “Reserved” above returns
zeros, and writes have no effect.
Caution – Reads to noncacheable addresses not listed above may return zeroes or
alias an existing CSR in the table. Writes to noncacheable addresses not listed above may result in a no-op or invoke an alias to an existing CSR in the table and modify it unexpectedly. Software should protect addresses over the full range of 0x1FE.0000.0000 through 0x1FE.00FF.FFFF to prevent back-door access.
6.5
6.5.1
Ancillary State Registers
Overview of ASRs
SPARC-V9 provides up to 32 Ancillary State Registers (ASRs 0 .. 31). ASRs 0 .. 6 are defined by the SPARC-V9 ISA; ASRs 7 .. 15 are reserved for future use by the architecture. ASRs 16 .. 31 are available for use by an implementation.
52
UltraSPARC-IIi User’s Manual • October 1997
6.5.2
SPARC-V9-Defined ASRs
TABLE 6-7 defines the SPARC-V9 ASRs that must be supported by a conforming processor implementation. TABLE 6-8 suggests the assembly language syntax for accessing these registers.
TABLE 6-7 ASR Value
Mandatory SPARC-V9 ASRs
ASR Name Access Description Section
0016 0216 0316 0416 0516 0616
1.An
Y_REG COND_CODE_REG ASI_REG TICK_REG PC FP_STATUS_REG
RW RW RW R1,2 R2 RW
Y register Condition code register ASI register TICK register Program Counter Floating-point status register
V9 V9 V9 V9 V9 V9
attempt to read this register by non-privileged software with NPT = 1 causes a privileged_action trap. The tick register can only be written with the privileged wrpr instruction. attempt to write this register causes an illegal_instruction trap.
2.Read-only—an
TABLE 6-8 Operation
Suggested Assembler Syntax for Mandatory ASRs
Syntax
rd wr rd wr rd wr rd rd rd wr
%y, regrd regrs1,reg_or_imm, %y %ccr, regrd regrs1,reg_or_imm, %ccr %asi, regrd regrs1,reg_or_imm, %asi %tick, regrd %pc regrd %fprs, regrd regrs1,reg_or_imm, %fprs
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
53
6.5.3
Non-SPARC-V9 ASRs
Non-SPARC-V9 ASRs are listed in TABLE 6-9.
TABLE 6-9 ASR Value
Non-SPARC-V9 ASRs
ASR Name/Syntax Access Description Section
1016 1116 1216 1316 1416 1516 1616 1716
1.Read
PERF_CONTROL_REG PERF_COUNTER DISPATCH_CONTROL_REG GRAPHIC_STATUS_REG SET_SOFTINT CLEAR_SOFTINT SOFTINT_REG TICK_CMPR_REG
RW3 RW4 RW3 RW2 W1 W1 RW3 RW3
Performance Control Reg (PCR) Performance Instrumentation Counters (PIC) Dispatch Control Register (DCR) Graphics Status Register (GSR) Set bit(s) in per-processor Soft Interrupt register Clear bit(s) in per-processor Soft Interrupt register Per-processor Soft Interrupt register TICK compare register
B.2 B.4 A.3 13.3 11.11 11.11 11.11 14.5.1
accesses cause an illegal_instruction trap. Nonprivileged write accesses cause a privileged_opcode trap. cause an fp_disabled trap if PSTATE.PEF or FPRS.FEF are zero. accesses cause a privileged_opcode trap. accesses with PCR.PRIV=0 cause a privileged_action trap.
2.Accesses
3.Nonprivileged 4.Nonprivileged
54
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-10 Operation
Suggested Assembler Syntax for Non-SPARC V9 ASRs
Syntax
rd wr rd wr rd wr wr wr rd wr rd wr rd wr
%pcr, regrd regrs1,%pcr %pic, regrd regrs1,%pic %gsr, regrd regrs1,%gsr regrs1,%clear_softint regrs1,%set_softint %softint, regrd regrs1,%softint %tick_cmpr, regrd regrs1,%tick_cmpr %dcr, regrd regrs1,%dcr
6.6
Other UltraSPARC-IIi Registers
TABLE 6-11 lists additional sets of 64-bit global registers supported by UltraSPARC-IIi.
TABLE 6-11
Other UltraSPARC-IIi Registers
Access Description Section
Register Name
INTERRUPT_GLOBAL_REG MMU_GLOBAL_REG
RW RW
8 Interrupt handler globals 8 MMU handler globals
14.5.9 14.5.9
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
55
6.7
Supported Traps
TABLE 6-12 lists the traps supported by UltraSPARC-IIi.
TABLE 6-12
Traps Supported in UltraSPARC-IIi
Globals9 TT Priority
Exception or Interrupt Request
Reserved
— AG AG AG AG AG MG AG AG AG AG AG AG AG AG AG MG AG AG AG AG AG AG IG AG
00016 00116 00216 00316 00416 00516 00816 00A16 01016 01116 02016 02116 02216 02316 02416 .. 02716 02816 03016 03216 03416 03516 03616 03716 04116 .. 04F16 06016 06116
n/a 0 11 11 11 11 5 3 710 6 8 112 112 14 10 15 123 123 104, 10 104 104 112 32 –n 165 125
power_on_reset
watchdog_reset externally_initiated_reset
software_initiated_reset RED_state_exception instruction_access_exception instruction_access_error illegal_instruction
privileged_opcode
fp_disabled
fp_exception_ieee_754 fp_exception_other
tag_overflow clean_window
division_by_zero data_access_exception data_access_error
mem_address_not_aligned
LDDF_mem_address_not_aligned STDF_mem_address_not_aligned
privileged_action interrupt_level_n (n = 1 .. 15)
interrupt_vector PA_watchpoint
56
UltraSPARC-IIi User’s Manual • October 1997
TABLE 6-12
Traps Supported in UltraSPARC-IIi (Continued)
Globals9 TT Priority
Exception or Interrupt Request
VA_watchpoint corrected_ECC_error fast_instruction_access_MMU_miss fast_data_access_MMU_miss fast_data_access_protection
AG AG MG MG MG AG AG AG AG AG
06216 06316 06416..06716 06816..06B16 06C16..06F16 08016 .. 09F16 0A016 .. 0BF16 0C016 .. 0DF16 0E016 .. 0FF16 10016 .. 17F16
112 33 26 123,7 123,8 9 9 9 9 165
spill_n_normal (n = 0 .. 7) spill_n_other (n = 0 .. 7) fill_n_normal (n = 0 .. 7) fill_n_other (n = 0 .. 7) trap_instruction
1.Priority
1 traps are processed in the following order: XIR>WDR>SIR>RED.
2.Fp_exception_ieee_754, fp_exception_other 3.Priority 4.Priority 5.Priority
are mutually exclusive with memory access traps such as privileged_action and VA_watchpoint. Privileged_action has higher priority than VA_watchpoint. 12 traps are processed in the following program order:
data_access_exception
> >
fast_data_access_MMU_miss/fast_data_access_protection > PA_watchpoint > data_access_error.
10 traps are processed in the following order: LDDF/STDF_mem_address_not_aligned mem_address_not_aligned trap. LDDF/STDF_mem_address_not_aligned traps are mutually exclusive. 16 traps are processed in the following order: trap instruction > interrupt_vector.
6.When an MMU fault is detected during an instruction access, a fast_instruction_access_MMU_miss trap is generated
instead of an instruction_access_MMU_miss trap.
7.A fast_data_access_MMU_miss trap is generated instead of a data_access_MMU_miss trap. 8.A fast_data_access_protection trap is generated instead of a data_access_protection trap. 9.AG
= alternate globals, MG = MMU globals, IG = interrupt globals
10.Some
ASIs must be used with specific types of loads and stores; for example, block ASIs can be used only with LDDFA/STDFA. When these ASIs are used with incorrect opcodes, they do not take mem_address_not_aligned or illegal_instruction traps for memory and register alignment required by the ASI. For example, block ASIs require 64-byte alignment, but an LDFA opcode with a block ASI checks only for 4-byte alignment.
Chapter 6
Address Spaces, ASIs, ASRs, and Traps
57
58
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
7
UltraSPARC-IIi Memory System
7.1
Overview
The UltraSPARC-IIi Memory system is designed to provide overall comparable performance with existing UltraSPARC systems, while using a narrower memory interface. Using EDO DRAMs achieves a CAS cycle half as long as that possible using FPM. Control signals are asserted on processor clock boundaries to allow precise control of DRAM signal transitions. In addition to addressing that supports 10-bit column address DRAMs, an additional mode supports 11-bit column addressing. Since the total available address bits in the memory controller is constant, at 1 GB maximum addressable, the maximum number of DIMM pairs in this mode is halved in 11-bit column address mode. The connectivity of RASB_L/RAST_L is critical and non-intuitive given the JEDEC standard pin names for the DIMMs. Exactly follow the schematics in FIGURE 7-1 and FIGURE 7-2. The B and T versions of RAS must go to the same DIMM since there are not separate B and T versions of the refresh enable/disable bits for each DIMM. See Section 18.2, “Mem_Control0 Register (0x1FE.0000.F010)” on page 279.
59
.
UltraSPARC-IIi memory interface
MEMADDR[12:0] RASB_L[3:0] RAST_L[3:0] CAS_L[1:0] WE_L ADDR
ADDR
RASB_L[0] RAST_L[0]
RAS# RAS# CAS# WE#
RASB_L[2] RAST_L[2]
RAS# RAS# CAS# WE#
DATA
DATA
72
XCVR interface
DATA 144 RASB_L[0] RAST_L[0]
ADDR RAS# RAST_L CAS# WE#
ADDR
RASB_L[2] RAST_L[2]
RAS# RAS# CAS# WE#
DATA
DATA
72
DIMM PAIR 0
DIMM PAIR 2
Two copies of CAS_L are provided only to reduce loading. Both are always asserted together. Real configuration needs buffers on RAS/CAS/WE. See design guide for requirements for min/max. delays and skew relationships.
FIGURE 7-1
Memory RAS Wiring with 10-bit Column, 8-128 MB DIMM
60
UltraSPARC-IIi User’s Manual • October 1997
UltraSPARC-IIi memory interface
MEMADDR[12:0] RASB_L[3:0] RAST_L[3:0] CAS_L[1:0] WE_L RASB_L[0] RAS# RAST_L[0] RAS# CAS# WE# ADDR RASB_L[1] RAS# RAST_L[1] RAS# CAS# WE# ADDR RASB_L[2] RAST_L[2] ADDR RAS# RAS# CAS# WE# RASB_L[3] RAST_L[3] ADDR RAS# RAS# CAS# WE#
DATA
72
DATA
DATA
DATA
XCVR interface
DATA 144
RASB_L[0] RAST_L[0] ADDR RAS# RAS# CAS# WE# DATA RASB_L[1] RAS# RAST_L[1] RAS# CAS# WE# ADDR RASB_L[2] RAST_L[2] ADDR RAS# RAS# CAS# WE# RASB_L[3] RAST_L[3] ADDR RAS# RAS# CAS# WE#
72
DATA
DATA
DATA
DIMM PAIR 0
DIMM PAIR 1
DIMM PAIR 2
DIMM PAIR 3
Two copies of CAS_L are provided only to reduce loading. Both are always asserted together. Real configuration needs buffers on RAS/CAS/WE. See design guide for requirements for min/max delays and skew relationships.
FIGURE 7-2
Memory RAS Wiring with 11-bit Column, 8-256MB DIMM
Chapter 7
UltraSPARC-IIi Memory System
61
7.2
10-bit Column Addressing
23 29 26 19 15 11 7 3 0
Physical address
8 MB(1M x 16 parts)
0
ds
ROW
COL
16 MB(2M x 8 parts)
0
ds
ROW
COL
32 MB(4M x 4 parts)
0
ds
ROW
COL
64 MB(4M x 4 banked or 8M x 8 parts) ** 128 MB(8M x 8 banked parts)
u l ds s u l ds s
ROW
COL
ROW
COL
uls = upper/lower bank select ds = DIMM pair select
** uls used if banked, otherwise uls = 0 and msbs of the row address may or may not be 0.
FIGURE 7-3
UltraSPARC-IIi Memory Addressing for 10-bit Column Address Mode
In this scheme, PA[28:27] is used as a DIMM select; it selects a DIMM-pair. PA[29] is used as a upper/lower bank select: 0 = bottom bank, 1 = top bank. DIMMs that contain only a single (bottom) bank must have PA[29] = 0 to be accessed. Mapping of PA[29:27] to RAS assertion is shown in TABLE 7-3.
62
UltraSPARC-IIi User’s Manual • October 1997
TABLE 7-1 PA[29:27]
PA[29:27] to RASX_L Mapping for 10-bit Column Address Mode
RAS_L Asserted
000 001 010 011 100 101 110 111
RASB_L[0] RASB_L[1] RASB_L[2] RASB_L[3] RAST_L[0] RAST_L[1] RAST_L[2] RAST_L[3]
TABLE 7-2 DIMM Pair
Memory Address Map for 10-bit Column Address Mode
Individual DIMM size Address Range (PA[29:0])
0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2
8MB 16MB 32MB 64MB 64MB (banked) 128MB (banked) 8MB 16MB 32MB 64MB 64MB (banked) 128MB (banked) 8MB 16MB 32MB 64MB
0x0000_0000 to 0x00FF_FFFF 0x0000_0000 to 0x01FF_FFFF 0x0000_0000 to 0x03FF_FFFF 0x0000_0000 to 0x07FF_FFFF 0x0000_0000 to 0x03FF_FFFF and 0x2000_0000 to 0x23FF_FFFF 0x0000_0000 to 0x07FF_FFFF and 0x2000_0000 to 0x27FF_FFFF 0x0800_0000 to 0x08FF_FFFF 0x0800_0000 to 0x09FF_FFFF 0x0800_0000 to 0x0BFF_FFFF 0x0800_0000 to 0x0FFF_FFFF 0x0800_0000 to 0x0BFF_FFFF and 0x2800_0000 to 0x2BFF_FFFF 0x0800_0000 to 0x0FFF_FFFF and 0x2800_0000 to 0x2FFF_FFFF 0x1000_0000 to 0x10FF_FFFF 0x1000_0000 to 0x11FF_FFFF 0x1000_0000 to 0x13FF_FFFF 0x1000_0000 to 0x17FF_FFFF
Chapter 7
UltraSPARC-IIi Memory System
63
TABLE 7-2 DIMM Pair
Memory Address Map for 10-bit Column Address Mode (Continued)
Individual DIMM size Address Range (PA[29:0])
2 2 3 3 3 3 3 3
64MB (banked) 128MB (banked) 8MB 16MB 32MB 64MB 64MB (banked) 128MB (banked)
0x1000_0000 to 0x13FF_FFFF and 0x3000_0000 to 0x33FF_FFFF 0x1000_0000 to 0x17FF_FFFF and 0x3000_0000 to 0x37FF_FFFF 0x1800_0000 to 0x18FF_FFFF 0x1800_0000 to 0x19FF_FFFF 0x1800_0000 to 0x1BFF_FFFF 0x1800_0000 to 0x1FFF_FFFF 0x1800_0000 to 0x1BFF_FFFF and 0x3800_0000 to 0x3BFF_FFFF 0x1800_0000 to 0x1FFF_FFFF and 0x3800_0000 to 0x3FFF_FFFF
64
UltraSPARC-IIi User’s Manual • October 1997
7.3
11-bit Column Addressing
ds 29 23 26 19 15 11 7 3 0
Physical address
8 MB(1M x 16 parts)
0 ds
ROW
COL
16 MB(2M x 8 parts)
0 ds
ROW
COL
32 MB(4M x 4 parts)
0
ds
ROW
COL
64 MB(4M x 4 banked or 8M x 8 parts) ** 128 MB(8M x 8 banked or 16M x 4 parts) ** 256 MB(16M x 4 banked)
u l ds s u l ds s u l ds s
ROW
COL
ROW
COL
ROW
COL
uls = upper/lower bank select ds = DIMM pair select
** uls used if banked, otherwise uls = 0 and msbs of the row address may or may not be 0.
FIGURE 7-4
UltraSPARC-IIi Memory Addressing for 11-bit Column Address Mode
In this scheme, PA[28] is used as a DIMM select; it selects a DIMM-pair. PA[29] is used as a upper/lower bank select: 0 = bottom bank, 1 = top bank. DIMMs that contain only a single (bottom) bank must have PA[29] = 0 in order to be accessed. The mapping of PA[29:28]into RASX_L[?] is shown in TABLE 7-3.
Chapter 7
UltraSPARC-IIi Memory System
65
TABLE 7-3 PA[29:28]
PA[29:28] to RASX_L Mapping for 11-bit Column Address Mode
RAS_L Asserted
00 01 10 11
RASB_L[0] RASB_L[2] RAST_L[0] RAST_L[2]
TABLE 7-4 DIMM Pair
Memory Address Map for 11-bit Column Address Mode
Individual DIMM size Address Range (PA[29:0])
0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2
8MB 16MB 32MB 64MB 64MB (banked) 128MB 128MB (banked) 256MB (banked) 8MB 16MB 32MB 64MB 64MB (banked) 128MB 128MB (banked) 256MB (banked)
0x0000_0000 to 0x00FF_FFFF 0x0000_0000 to 0x01FF_FFFF 0x0000_0000 to 0x03FF_FFFF 0x0000_0000 to 0x07FF_FFFF 0x0000_0000 to 0x03FF_FFFF and 0x2000_0000 to 0x23FF_FFFF 0x0000_0000 to 0x0FFF_FFFF 0x0000_0000 to 0x07FF_FFFF and 0x2000_0000 to 0x27FF_FFFF 0x0000_0000 to 0x0FFF_FFFF and 0x2000_0000 to 0x2FFF_FFFF 0x1000_0000 to 0x10FF_FFFF 0x1000_0000 to 0x11FF_FFFF 0x1000_0000 to 0x13FF_FFFF 0x1000_0000 to 0x17FF_FFFF 0x1000_0000 to 0x13FF_FFFF and 0x3000_0000 to 0x33FF_FFFF 0x1000_0000 to 0x1FFF_FFFF 0x1000_0000 to 0x17FF_FFFF and 0x3000_0000 to 0x37FF_FFFF 0x1000_0000 to 0x1FFF_FFFF and 0x3000_0000 to 0x3FFF_FFFF
66
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
8
Cache and Memory Interactions
8.1
Introduction
This chapter describes various interactions between the caches and memory, and the management processes that an operating system must perform to maintain data integrity in these cases. In particular, it discusses:
s s s s s s s
Invalidation of one or more cache entries – when and how to do it Differences between cacheable and non-cacheable accesses Ordering and synchronization of memory accesses Accesses to addresses that cause side effects (I/O accesses) Non-faulting loads Instruction prefetching Load and store buffers
This chapter only addresses coherence in a uniprocessor environment. For more information about coherence in multi-processor environments, see Chapter 20, “SPARC-V9 Memory Models.”
8.2
Cache Flushing
Data in the level-1 (read-only or write-through) caches can be flushed by invalidating the entry in the cache. Modified data in the level-2 (writeback) cache— subsequently referred to as the External or E-cache—must be written back to memory when flushed.
67
Cache flushing is required in the following cases:
s
I-cache: Flush is needed before executing code that is modified by a local store instruction other than block commit store, see Section 3.1.1.1, “Instruction Cache (I-cache).” This is done with the FLUSH instruction or using ASI accesses. See Section A.7, “I-cache Diagnostic Accesses” on page 387. When ASI accesses are used, software must ensure that the flush is done on the same processor as the stores that modified the code space. D-cache: Flush is needed when a physical page is changed from (virtually) cacheable to (virtually) noncacheable, or when an illegal address alias is created (see Section 8.2.1, “Address Aliasing Flushing” on page 68). This is done with a displacement flush (see Section 8.2.3, “Displacement Flushing” on page 69) or using ASI accesses. See Section A.8, “D-cache Diagnostic Accesses” on page 392. E-cache: Flush is needed for stable storage. Examples of stable storage include battery-backed memory and transaction logs. This is done with either a displacement flush (see Section 8.2.3, “Displacement Flushing” on page 69) or a store with ASI_BLK_COMMIT_{PRIMARY,SECONDARY}. Flushing the E-cache flushes the corresponding blocks from the I- and D-caches, because UltraSPARC-IIi maintains inclusion between the external and internal caches. See Section 8.2.2, “Committing Block Store Flushing” on page 69.
s
s
8.2.1
Address Aliasing Flushing
A side-effect inherent in a virtual-indexed cache is illegal address aliasing. Aliasing occurs when multiple virtual addresses map to the same physical address. Since UltraSPARC-IIi’s D-cache is indexed with the virtual address bits and is larger than the minimum page size, it is possible for the different aliased virtual addresses to end up in different cache blocks. Such aliases are illegal because updates to one cache block will not be reflected in aliased cache blocks. Normally, software avoids illegal aliasing by forcing aliases to have the same address bits (virtual color) up to an alias boundary. For UltraSPARC-IIi, the minimum alias boundary is 16 kB; this size may increase in future designs. When the alias boundary is violated, software must flush the D-cache if the page was virtual cacheable. In this case, only one mapping of the physical page can be allowed in the D-MMU at a time. Alternatively, software can turn off virtual caching of illegally aliased pages. This allows multiple mappings of the alias to be in the D-MMU and avoids flushing the D-cache each time a different mapping is referenced.
Note – A change in virtual color when allocating a free page does not require a
D-cache flush, because the D-cache is write-through.
68
UltraSPARC-IIi User’s Manual • October 1997
8.2.2
Committing Block Store Flushing
In UltraSPARC-IIi, stable storage must be implemented by software cache flush. Data that is present and modified in the E-cache must be written back to the stable storage. Two ASIs: (ASI_BLK_COMMIT_{PRIMARY,SECONDARY}) are implemented by UltraSPARC-IIi to perform these writebacks efficiently when software can ensure exclusive write access to the block being flushed. Using these ASIs, software can write back data from the floating-point registers to memory and invalidate the entry in the cache. The data in the floating-point registers must first be loaded by a block load instruction. A MEMBAR #Sync instruction is needed to ensure that the flush is complete. See also Section 13.5.3, “Block Load and Store Instructions” on page 172.
8.2.3
Displacement Flushing
Cache flushing also can be accomplished by a displacement flush. This is done by reading a range of read-only addresses that map to the corresponding cache line being flushed, forcing out modified entries in the local cache. Care must be taken to ensure that the range of read-only addresses is mapped in the MMU before starting a displacement flush, otherwise the TLB miss handler may put new data into the caches.
Note – Diagnostic ASI accesses to the E-cache can be used to invalidate a line, but
they are generally not an alternative to displacement flushing. Modified data in the E-cache will not be written back to memory using these ASI accesses. See Section A.9, “E-cache Diagnostics Accesses” on page 394.
8.3
Memory Accesses and Cacheability
Note – Atomic load-store instructions are treated as both a load and a store; they
can be performed only in cacheable address spaces.
Chapter 8
Cache and Memory Interactions
69
8.3.1
Coherence Domains
Two types of memory operations are supported in UltraSPARC-IIi: cacheable and noncacheable accesses, as indicated by the page translation. Cacheable accesses are inside the coherence domain; noncacheable accesses are outside the coherence domain. SPARC-V9 does not specify memory ordering between cacheable and noncacheable accesses. In TSO mode, UltraSPARC-IIi maintains TSO ordering, regardless of the cacheability of the accesses. For SPARC-V9 compatibility while in PSO or RMO mode, a MEMBAR #Lookaside should be used between a store and a subsequent load to the same noncacheable address. See The SPARC Architecture Manual, Version 9 for more information about the SPARC-V9 memory models.
Note – On UltraSPARC-IIi, a MEMBAR #Lookaside executes more efficiently than
a MEMBAR #StoreLoad.
8.3.1.1
Cacheable Accesses
Accesses that fall within the coherence domain are called cacheable accesses. They are implemented in UltraSPARC-IIi with the following properties:
s s s
Data resides in real memory locations. They observe supported cache coherence protocol. The unit of coherence is 64 bytes.
8.3.1.2
Non-Cacheable and Side-Effect Accesses
Accesses that are outside the coherence domain are called noncacheable accesses. Accesses of some of these memory (or memory mapped) locations may result in side-effects. Noncacheable accesses are implemented in UltraSPARC-IIi with the following properties:
s s
Data may or may not reside in real memory locations. Accesses may result in program-visible side-effects; for example, memorymapped I/O control registers in a UART may change state when read. Accesses may not observe supported cache coherence protocol. The smallest unit in each transaction is a single byte.
s s
70
UltraSPARC-IIi User’s Manual • October 1997
Noncacheable accesses with the E-bit set (that is, those having side-effects) are all strongly ordered with respect to other noncacheable accesses with the E-bit set. In addition, store buffer compression is disabled for these accesses. Speculative loads with the E-bit set cause a data_access_exception trap (with SFSR.FT=2, speculative load to page marked with E-bit).
Note – The side-effect attribute does not imply noncacheability.
8.3.1.3
Global Visibility and Memory Ordering
To ensure the correct ordering between the cacheable and noncacheable domains, explicit memory synchronization is needed in the form of MEMBARs or atomic instructions. CODE EXAMPLE 8-1 illustrates the issues involved in mixing cacheable and noncacheable accesses.
CODE EXAMPLE 8-1
Memory Ordering and MEMBAR Examples
Assume that all accesses go to non-side-effect memory locations. Process A: While (1) { Store D1:data produced 1 MEMBAR #StoreStore (needed in PSO, RMO) Store F1:set flag While F1 is set (spin on flag) Load F1 2 MEMBAR #LoadLoad | #LoadStore (needed in RMO) Load D2 }
Process B: While (1) { While F1 is cleared (spin on flag) Load F1 2 MEMBAR #LoadLoad | #LoadStore (needed in RMO) Load D1 Store D2 1 MEMBAR #StoreStore (needed in PSO, RMO) Store F1:clear flag } Chapter 8 Cache and Memory Interactions 71
Note – A MEMBAR #MemIssue or MEMBAR #Sync is needed if ordering of
cacheable accesses following noncacheable accesses must be maintained in PSO or RMO. Due to load and store buffers implemented in UltraSPARC-IIi, CODE EXAMPLE 8-1 may not work in PSO and RMO modes without the MEMBARs shown in the program segment. In TSO mode, loads and stores (except block stores) cannot pass earlier loads, and stores cannot pass earlier stores; therefore, no MEMBAR is needed. In PSO mode, loads are completed in program order, but stores are allowed to pass earlier stores; therefore, only the MEMBAR at #1 is needed between updating data and the flag. In RMO mode, there is no implicit ordering between memory accesses; therefore, the MEMBARs at both #1 and #2 are needed.
8.3.2
Memory Synchronization: MEMBAR and FLUSH
The MEMBAR (STBAR in SPARC-V8) and FLUSH instructions are provide for explicit control of memory ordering in program execution. MEMBAR has several variations; their implementations in UltraSPARC-IIi are described below. See the references to “Memory Barrier,” “The MEMBAR Instruction,” and “Programming With the Memory Models,” in The SPARC Architecture Manual, Version 9 for more information.
8.3.2.1
MEMBAR #LoadLoad
Forces all loads after the MEMBAR to wait until all loads before the MEMBAR have reached global visibility.
8.3.2.2
MEMBAR #StoreLoad
Forces all loads after the MEMBAR to wait until all stores before the MEMBAR have reached global visibility.
8.3.2.3
MEMBAR #LoadStore
Forces all stores after the MEMBAR to wait until all loads before the MEMBAR have reached global visibility.
72
UltraSPARC-IIi User’s Manual • October 1997
8.3.2.4
MEMBAR #StoreStore and STBAR
Forces all stores after the MEMBAR to wait until all stores before the MEMBAR have reached global visibility.
Note – STBAR has the same semantics as MEMBAR #StoreStore; it is included
for SPARC-V8 compatibility.
Note – The above four MEMBARs do not guarantee ordering between cacheable
accesses after noncacheable accesses.
8.3.2.5
MEMBAR #Lookaside
SPARC-V9 provides this variation for implementations having virtually tagged store buffers that do not contain information for snooping.
Note – For SPARC-V9 compatibility, this variation should be used before issuing a
load to an address space that cannot be snooped.
8.3.2.6
MEMBAR #MemIssue
Forces all outstanding memory accesses to be completed before any memory access instruction after the MEMBAR is issued. It must be used to guarantee ordering of cacheable accesses following non-cacheable accesses. For example, I/O accesses must be followed by a MEMBAR #MemIssue before subsequent cacheable stores; this ensures that the I/O accesses reach global visibility before the cacheable stores after the MEMBAR.
Note – MEMBAR #MemIssue is different from the combination of MEMBAR
#LoadLoad | #LoadStore | #StoreLoad | #StoreStore. MEMBAR #MemIssue orders cacheable and noncacheable domains; it prevents memory accesses after it from issuing until it completes.
8.3.2.7
MEMBAR #Sync (Issue Barrier)
Forces all outstanding instructions and all deferred errors to be completed before any instructions after the MEMBAR are issued.
Chapter 8
Cache and Memory Interactions
73
Note – MEMBAR #Sync is a costly instruction; unnecessary usage may result in
substantial performance degradation.
8.3.2.8
Self-Modifying Code (FLUSH)
The SPARC-V9 instruction set architecture does not guarantee consistency between code and data spaces. A problem arises when code space is dynamically modified by a program writing to memory locations containing instructions. LISP programs and dynamic linking require this behavior. SPARC-V9 provides the FLUSH instruction to synchronize instruction and data memory after code space has been modified. In UltraSPARC-IIi, a FLUSH behaves like a store instruction for the purpose of memory ordering. In addition, all instruction fetch (or prefetch) buffers are invalidated. The issue of the FLUSH instruction is delayed until previous (cacheable) stores are completed. Instruction fetch (or prefetch) resumes at the instruction immediately after the FLUSH.
8.3.3
Atomic Operations
SPARC-V9 provides three atomic instructions to support mutual exclusion. These instructions behave like both a load and a store but the operations are carried out indivisibly. Atomic instructions may be used only in the cacheable domain. An atomic access with a restricted ASI in unprivileged mode (PSTATE.PRIV=0) causes a privileged_action trap. An atomic access with a noncacheable address causes a data_access_exception trap (with SFSR.FT=4, atomic to page marked non-cacheable). An atomic access with an unsupported ASI causes a data_access_exception trap (with SFSR.FT=8, illegal ASI value or virtual address). TABLE 8-1 lists the ASIs that support atomic accesses .
ASIs that Support SWAP, LDSTUB, and CAS
Access
TABLE 8-1 ASI Name
ASI_NUCLEUS{_LITTLE} ASI_AS_IF_USER_PRIMARY{_LITTLE} ASI_AS_IF_USER_SECONDARY{_LITTLE}
Restricted Restricted Restricted
74
UltraSPARC-IIi User’s Manual • October 1997
TABLE 8-1 ASI Name
ASIs that Support SWAP, LDSTUB, and CAS
Access
ASI_PRIMARY{_LITTLE} ASI_SECONDARY{_LITTLE} ASI_PHYS_USE_EC{_LITTLE}
Unrestricted Unrestricted Unrestricted
Note – Atomic accesses with non-faulting ASIs are not allowed, because these ASIs
have the load-only attribute.
8.3.3.1
SWAP Instruction
SWAP atomically exchanges the lower 32 bits in an integer register with a word in memory. This instruction is issued only after store buffers are empty. Subsequent loads interlock on earlier SWAPs. A cache miss allocates the corresponding line.
Note – If a page is marked as virtually-non-cacheable but physically cacheable,
allocation is done to the E-cache only.
8.3.3.2
LDSTUB Instruction
LDSTUB behaves like SWAP, except that it loads a byte from memory into an integer register and atomically writes all ones (FF 16) into the addressed byte.
8.3.3.3
Compare and Swap (CASX) Instruction
Compare-and-swap combines a load, compare, and store into a single atomic instruction. It compares the value in an integer register to a value in memory; if they are equal, the value in memory is swapped with the contents of a second integer register. All of these operations are carried out atomically; in other words, no other memory operation may be applied to the addressed memory location until the entire compare-and-swap sequence is completed.
8.3.4
Non-Faulting Load
A non-faulting load behaves like a normal load, except that:
Chapter 8
Cache and Memory Interactions
75
s
It does not allow side-effect access. An access with the E-bit set causes a data_access_exception trap (with SFSR.FT=2, Speculative Load to page marked E-bit). It can be applied to a page with the NFO-bit set; other types of accesses will cause a data_access_exception trap (with SFSR.FT=10 16, Normal access to page marked NFO).
s
Non-faulting loads are issued with ASI_PRIMARY_NO_FAULT{_LITTLE}, or ASI_SECONDARY_NO_FAULT{_LITTLE}. A store with a NO_FAULT ASI causes a data_access_exception trap (with SFSR.FT=8, Illegal RW). When a non-faulting load encounters a TLB miss, the operating system should attempt to translate the page. If the translation results in an error (for example, address out of range), a 0 is returned and the load completes silently. Typically, optimizers use non-faulting loads to move loads before conditional control structures that guard their use. This technique potentially increases the distance between a load of data and the first use of that data, to hide latency; it allows for more flexibility in code scheduling. It also allows for improved performance in certain algorithms by removing address checking from the critical code path. For example, when following a linked list, non-faulting loads allow the null pointer to be accessed safely in a read-ahead fashion if the OS can ensure that the page at virtual address 016 is accessed with no penalty. The NFO (non-fault access only) bit in the MMU marks pages that are mapped for safe access by non-faulting loads, but can still cause a trap by other, normal accesses. This allows programmers to trap on wild pointer references (many programmers count on an exception being generated when accessing address 016 to debug code) while benefitting from the acceleration of non-faulting access in debugged library routines.
8.3.5
PREFETCH Instructions
UltraSPARC-IIi has extensions to support the v9 Prefetch instruction. These extensions primarily address floating-point vector code, in which the software (compiler) can accurately schedule the prefetch of data sufficiently ahead of its usage, and in which execution is bounded by (E-cache) miss throughput. UltraSPARC-IIi allows loads and stores (E-cache-hits) to continue while a prefetch (E-cache-miss) is outstanding. An outstanding Prefetch does not block subsequent load or store hits. This extension from UltraSPARC allows greater miss throughput. The UltraSPARC Load Buffer is designed such that a load with an E-cache-miss blocks subsequent load hits; these load-hits in turn block subsequent load misses. This tends to serialize load-misses.
76
UltraSPARC-IIi User’s Manual • October 1997
However, Prefetch misses do not block subsequent load hits. Hence prefetches can be scheduled sufficiently far in advance of the associated Load (or Store) instruction, without interfering with subsequent loads and stores. Prefetches appear as Loads that do not return data to a register. A prefetch request that is sent to the ECU checks the E-cache for the block. If the Prefetch hits in the E-cache, the operation will be complete; if it does not hit, the ECU requests that block from the Memory Control Unit (MCU). When the MCU returns the requested data, it is only written into the E-cache, not into the D-cache.
8.3.5.1
PREFETCH Behavior and Limitations
s
All PREFETCH instructions are enqueued on the load buffer, except as noted below. Some conditions, noted below, cause an otherwise supported PREFETCH to be treated as a no-op and removed from the load buffer when it reaches the front of the queue. No PREFETCH will cause a trap except:
s
s
s
PREFETCH with fcn=5 .. 15 causes an illegal_instruction trap, as defined in The SPARC Architecture Manual, Version 9. Watchpoint, as defined in Section A.5, “Watchpoint Support” on page 382.
s
s
Any PREFETCHA that specifies an internal ASI in the following ranges is not enqueued on the load buffer and is not executed:
s
4016..4F16, 5016..5F16, 6016..6F16, 7616, 7716 PREFETCH with fcn=16..31, as defined in The SPARC Architecture Manual, Version 9. A data_access_MMU_miss exception D-MMU disabled For PREFETCHA, any ASI other than the following 04 16, 0C16, 1016, 1116, 1816, 1916, 8016..8316, 8816..8B16 Attempt to PREFETCH to a noncacheable page
s
The following conditions cause a PREFETCH{A} to be treated as a NOP:
s
s
s
s
s
s
fcn==1616..3116
s
Alignment is not checked on PREFETCH{A}. The 5 least significant address are ignored.
Chapter 8
Cache and Memory Interactions
77
8.3.5.2
Implemented fcn Values
TABLE 8-2 lists the supported values for fcn and their meanings.
TABLE 8-2 fcn
PREFETCH{A} Variants
Prefetch function Action
0 1 4 2 3 5-15 16-31
Prefetch for several reads Prefetch for one read Prefetch page Prefetch for several writes Prefetch for one write reserved Implementation-dependent Generate DRAM read if the desired line is not E-cache-resident illegal-instruction trap no-op Generate DRAM read if the desired line is not E-cache-resident
For more information, including an enumeration of the bus transaction that each fcn value causes, see Section 14.4.5, “PREFETCH{A} (Impdep #103, 117)” on page 197.
8.3.6
Block Loads and Stores
Block load and store instructions work like normal floating-point load and store instructions, except that the data size (granularity) is 64 bytes per transfer. See Section 13.5.3, “Block Load and Store Instructions” on page 172 for a full description of the instructions.
8.3.7
I/O (PCI or UPA64S) and Accesses with Sideeffects
I/O locations may not behave with memory semantics. Loads and stores may have side-effects; for example, a read access may clear a register or pop an entry off a FIFO. A write access may set a register address port so that the next access to that address will read or write a particular internal registers, etc. Such devices are considered order sensitive. Also, such devices may only allow accesses of a fixed size, so store buffer merging of adjacent stores or stores within a 16-byte region will cause an access error. The UltraSPARC-IIi MMU includes an attribute bit (the E-Bit) in each page translation, which, when set, indicates that access to this page cause side effects. Accesses other than block loads or stores to pages that have this bit set have the following behavior:
78
UltraSPARC-IIi User’s Manual • October 1997
s s
Noncacheable accesses are strongly ordered with respect to each other Noncacheable loads with the E-bit set will not be issued until all previous control transfers (including exceptions) are resolved. Store buffer compression is disabled for noncacheable accesses. Non-faulting loads are not allowed and will cause a data_access_exception trap (with SFSR.FT = 2, speculative load to page marked E-bit). A MEMBAR may be needed between side-effect and non-side-effect accesses while in PSO and RMO modes.
s s
s
8.3.8
Instruction Prefetch to Side-Effect Locations
UltraSPARC-IIi does instruction prefetching and follows branches that it predicts will be taken. Addresses mapped by the I-MMU may be accessed even though they are not actually executed by the program. Normally, locations with side effects or those that generate time-outs or bus errors will not be mapped by the I-MMU, so prefetching will not cause problems. When running with the I-MMU disabled, however, software must avoid placing data in the path of a control transfer instruction target or sequentially following a trap or conditional branch instruction. Data can be placed sequentially following the delay slot of a BA(,pt), CALL, or JMPL instruction. Instructions should not be placed within 256 bytes of locations with side effects. See Section 21.2.10, “Return Address Stack (RAS)” on page 349 for other information about JMPLs and RETURNs.
8.3.9
Instruction Prefetch When Exiting RED_state
Exiting RED_state by writing 0 to PSTATE.RED in the delay slot of a JMPL is not recommended. A noncacheable instruction prefetch may be made to the JMPL target, which may be in a cacheable memory area. This may result in a bus error on some systems, which will cause an instruction_access_error trap. The trap can be masked by setting the NCEEN bit in the ESTATE_ERR_EN register to zero, but this will mask all non-correctable error checking. To avoid this problem exit RED_state with DONE or RETRY, or with a JMPL to a noncacheable target address.
8.3.10
UltraSPARC-IIi Internal ASIs
ASIs in the ranges 4616 .. 6F16 and 7616 ..7F16 are used for accessing internal UltraSPARC-IIi states. Stores to these ASIs do not follow the normal memory model ordering rules. Correct operation requires the following:
Chapter 8
Cache and Memory Interactions
79
s
A MEMBAR #Sync is needed after an internal ASI store other than MMU ASIs before the point that side effects must be visible. This MEMBAR must precede the next load or noninternal store. The MEMBAR also must be in or before the delay slot of a delayed control transfer instruction of any type. This is necessary to avoid corrupting data. A FLUSH, DONE, or RETRY is needed after an internal store to the MMU ASIs (ASI 5016..5216, 5416..5F16) or to the IC bit in the LSU control register before the point that side effects must be visible. Stores to D-MMU registers other than the context ASIs may also use a MEMBAR #Sync. One of these instructions must precede the next load or noninternal store. They also must be in or before the delay slot of a delayed control transfer instruction. This is necessary to avoid corrupting data.
s
8.4
Load Buffer
The load buffer allows the load and execution pipelines in UltraSPARC-IIi to be decoupled; thus, loads that cannot return data immediately will not stall the pipeline but, rather, will be buffered until they can return data. For example, when a load misses the on-chip D-cache and must access the E-cache, the load will be placed in the load buffer and the execution pipelines will continue moving as long as they do not require the register that is being loaded. An instruction that attempts to use the data that is being loaded by an instruction in the load buffer is called a ‘use’ instruction. The pipelines are not fully decoupled, because UltraSPARC-IIi still supports the notion of precise traps, and loads that are younger than a trapping instruction must not execute, except in the case of deferred traps. Loads themselves can take precise traps, when exceptions are detected in the pipeline. For example, address misalignment or access violations detected in the translation process will both be reported as precise traps. However, when a load has a hardware problem on the external bus (for example, a parity error), it will generate a deferred trap since younger instructions, unblocked by the D-cache miss, could have been retired and modified the machine state. This may result in termination of the user thread or reset. UltraSPARC-IIi does not support recovery from such hardware errors, and they are fatal. See Chapter 16, “Error Handling.”
80
UltraSPARC-IIi User’s Manual • October 1997
8.5
Store Buffer
All store operations (including atomic and STA instructions) and barriers or store completion instructions (MEMBAR and STBAR) are entered into the Store Buffer.
8.5.1
Stores Delayed by Loads
The store buffer normally has lower priority than the load buffer when arbitrating for the D-cache or E-cache, since returning load data is usually more critical than store completion. To ensure that stores complete in a finite amount of time as required by SPARC-V9, UltraSPARC-IIi eventually will raise the store buffer priority above load buffer priority if the store buffer is continually locked out by subsequent loads (other than internal ASI loads). Software using a load spin loop to wait for a signal from another processor following a store that signals that processor waits for the store to time out in the store buffer. For this type of code, it is more efficient to put a MEMBAR #StoreLoad between the store and the load spin loop.
8.5.2
Store Buffer Compression
Consecutive non-side-effect stores may be combined into aligned 8-byte entries in the store buffer to improve store bandwidth. Cacheable stores can only be compressed with adjacent cacheable stores, Likewise, noncacheable stores can only be compressed with adjacent noncacheable stores. In order to maintain strong ordering for I/O accesses, stores with the side-effect attribute (E-bit set) cannot be combined with any other stores. The memory control unit can also compress consecutive 8-byte stores into single 16byte UPA64S transactions.
Chapter 8
Cache and Memory Interactions
81
82
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
9
PCI Bus Interface
9.1
Introduction
This chapter describes the PCI Bus Interface Module (PBM) of UltraSPARC-IIi. The PBM is a 0-66 MHz 32-bit host-PCI bridge. The Advanced PCI Bridge (APB) provides an external connection to two 32-bit 0-33 MHz PCI busses. APB forwards transactions in both directions, between these primary and secondary PCI busses. Main features: s Operates with a 2x PCI clock. (40-132 MHz) s Single 64-byte DMA read/write buffers, single 64-byte PIO read/write buffer s Little-endian to the bus and internal configuration space
9.1.1
Supported PCI features:
s s s s s s s s s
64-bit Addressing (Dual Address Cycle) for DMA bypass Required adapter and host-bridge configuration space header registers Fast Back-to-Back cycles as a DMA target Arbitrary byte enables (Consistent DMA) Optional external arbiter Ability to generate memory, I/O, and configuration read and write cycles Ability to generate special cycles Ability to receive memory cycles Peer-to-peer DMA on a single segment
83
9.1.2
Unsupported PCI features:
s s s s s s s s
Exclusive Access to main memory (LOCK) Peer-to-peer transfers between bus segments Cache support Cache-line Wrap Addressing Mode Fast Back-to-Back cycles as a PIO master Address/Data Stepping Subtractive decode Any DOS compatibility features
9.2
9.2.1
PCI Bus Operations
Basic Read/Write Cycles
Read and write transactions occur as specified in the PCI specification. When a DMA burst transfer goes over a line (64 B) boundary, UltraSPARC-IIi generates a disconnect. This disconnect normally causes the master device to reattempt the transaction at the address of the next untransferred data. UltraSPARC-IIi is capable of generating arbitrary byte enables on PIO writes. It can also generated aligned PIO reads of 1, 2, 4, 8, 16, and 64 bytes. A target device is required to drive all data bytes on reads, but is not required to support arbitrary byte enables on writes and may terminate the cycle with a target-abort if an illegal byte enable combination is signalled. UltraSPARC-IIi supports arbitrary byte enables for all DMA transactions. The PBM can accept Dual-Address-Cycles, using the 64-bit address in bypass mode. UltraSPARC-IIi does not generate 64-bit PIO cycles or PIOs with DACs.
9.2.2
Transaction Termination Behavior
s
Retries: For PIO transactions, a count is kept of the number of retries for a given transaction. When this value exceeds the Retry Limit Count the PBM ceases to attempt the transaction and issues an interrupt to the processor. The Retry Limit Count is fixed at 512.
84
UltraSPARC-IIi User’s Manual • October 1997
s
s
s
Disconnects: The difference between a disconnect and a retry is that there is no data transferred during a retry; otherwise, the signalling is the same. No count is kept of disconnects. The transaction is restarted with the next untransferred data. Master-aborts: A master-abort typically happens when no device responds to the PIO address. Target-aborts: A target-abort may be received for a variety of error conditions. All cases for which UltraSPARC-IIi may signal a target-abort are given in Chapter 16, “Error Handling.”
9.2.3
Addressing Modes
Only the Linear Incrementing addressing mode is supported. Reserved and Cache Line Wrap address mode accesses are disconnected after the first data phase, allowing the master to complete the transfer one data word at a time.
9.2.4
Configuration Cycles
UltraSPARC-IIi generates both Type 0 and Type 1 configuration accesses. The type generated depends on the bus number field within the configuration address. UltraSPARC-IIi hardwires its Bus Number to 0. See Section 19.3.1, “PCI Configuration Space” on page 300 for details.
Compatibility Note – If Configuration cycles are generated with compressed
(E-bit==0) byte or halfword stores, or with random byte enable patterns using the PSTORE instruction, UltraSPARC-IIi does not guarantee that AD[1:0] points to the first byte with a BE asserted. Also, while not addressed by the PCI 2.1 specification UltraSPARC-IIi can generate multi-databeat configuration reads and writes.
9.2.5
Special Cycles
UltraSPARC-IIi ignores Special Cycles and does not generate them.
Chapter 9
PCI Bus Interface
85
9.2.6
PCI INT_ACK Generation
UltraSPARC-IIi can generate an interrupt acknowledge in response to a PCI Interrupt. See Section 19.3.4, “PCI INT_ACK Generation” on page 322 for the method of generating this transaction.
9.2.7
Exclusive Access
UltraSPARC-IIi does not implement locking and the LOCK# signal is not connected. Any exclusive access proceeds as if it were a non-exclusive access.
9.2.8
Fast Back-to-Back Cycles
UltraSPARC-IIi is capable of handling Fast Back-to-Back DMA transactions as a target device. The Fast Back-to-Back Capable bit in the Status register is hardwired to ‘1’. It handles the master-based mechanism (as required) and is capable of decoding the target-based mechanism as well. The address is checked and UltraSPARC-IIi does not reply to masters presenting an invalid address. The specification requires that TRDY#, DEVSEL#, and STOP# be delayed by one cycle unless this device were the target of the previous transaction. This delay causes writes to be extended by a cycle but is hidden on reads. There is little performance gain except for reads that follow writes, but support is provided for third party devices that choose to implement this feature. UltraSPARC-IIi is not capable of generating Fast Back-to-Back PIO transactions and does not implement the Fast Back-to-Back enable bit in the Command Register in the configuration header. A Fast Back-to-Back PIO would remove the idle cycle between two transactions to the same target as long as the first transaction were a write. Alternately stated, it would insert an idle cycle between transactions to different targets and after read transactions. UltraSPARC-IIi does not support this sequence.
86
UltraSPARC-IIi User’s Manual • October 1997
9.3
9.3.1
9.3.1.1
Functional Topics
PCI Arbiter
Arbitration Schemes
Two arbitration schemes are implemented in the UltraSPARC-IIi and APB on-chip PCI arbiters. The default condition is fair arbitration, where all enabled requests are serviced in “round-robin” fashion. The second condition (enabled by the ARB_PRIO bits in the PCI Control Register) gives higher priority to a specific request. This allows the device attached to that pair to claim, at most, every other PCI transaction. Additionally, a transaction that is Retried gets the highest priority the next time it asserts its request. Only one request at a time is given this high priority. The high priority remains in effect until the request is accepted without Retry.
9.3.1.2
Bus Parking
The ARB_PARK bit in the PCI Control Register causes the last GNT to remain asserted when no other requests are asserted. This results in a saving of one clock cycle for bursts of transactions from the same device.
9.3.2
PCI Commands
TABLE 9-1 lists the commands that the UltraSPARC-IIi PBM generates
TABLE 9-1 Command
PCI Command Generation
C/BE# Generate? Notes
Interrupt Acknowledge Special Cycle I/O Read I/O Write Reserved Reserved
0000 0001 0010 0011 0100 0101
Yes Yes Yes Yes No No
Chapter 9
PCI Bus Interface
87
TABLE 9-1 Command
PCI Command Generation (Continued)
C/BE# Generate? Notes
Memory Read Memory Write Reserved Reserved Configuration Read Configuration Write Memory Read Multiple Dual Address Cycle Memory Read Line Memory Write & Invalidate
0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Yes Yes No No Yes Yes Yes No Yes No
Perform read access, no prefetch Perform write access
Perform read with 8 byte prefetch
Perform read with 64 byte prefetch
TABLE 9-2 lists the commands to which UltraSPARC-IIi responds as a Target.
TABLE 9-2 Command
PCI Command Response
C/BE# Response
Interrupt Acknowledge Special Cycle I/O Read I/O Write Reserved Reserved Memory Read Memory Write Reserved Reserved Configuration Read Configuration Write Memory Read Multiple
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100
Ignored Ignored Ignored Ignored Ignored Ignored Perform read access. 64-byte prefetch if to memory; 16-byte prefetch if to UPA64S Perform write access Ignored Ignored Ignored Ignored Perform read with 64 byte prefetch
88
UltraSPARC-IIi User’s Manual • October 1997
TABLE 9-2 Command
PCI Command Response
C/BE# Response
Dual Address Cycle Memory Read Line Memory Write & Invalidate
1101 1110 1111
Bypass access Perform read with 64 byte prefetch Equivalent to Memory Write command
Note – All PCI DMA reads to UPA64S address space cause 64-byte read transactions
on the UPA64S. This action may cause unwanted prefetch effects. All DMA writes to UPA64S address space cause a succession of 1-16-byte UPA64S writes.
9.4
9.4.1
Little-endian Support
Endian-ness
The UltraSPARC-IIi internal, UPA64S, and DRAM system interfaces are big-endian, That is, the address of a word ( or quadword, doubleword, or halfword) is the address of its most significant byte. The PCI bus is little-endian, where the word (or quadword, doubleword …) address is the address of the least significant byte. See the section “Addressing Conventions” in Chapter 6 of The SPARC Architecture Manual, Version 9 for a detailed explanation of this topic. To route the byte lanes logically correctly, the UltraSPARC-IIi main internal data busses are connected to the PCI bus in a “byte-twisted” fashion. In particular, UltraSPARC-IIi data bits [63:56] are connected to the PCI data bits [7:0], UltraSPARC-IIi bits [55:48] map to PCI bits [15:8], an so on. The PBM internal control registers, which are big-endian, are bytetwisted again internally. This implementation causes all byte-sized PIOs and byte-stream DMA to be handled correctly. It, along with other features built into SPARC V9 processors, allows all PIO and DMA activity to and from the PCI bus to take place correctly.
Chapter 9
PCI Bus Interface
89
9.4.2
9.4.2.1
Big- and Little-endian regions
Address Space
The UltraSPARC-IIi 8 GB address space consists of several regions. The lower 16 MB, from 0x1FE.0000.0000 to 0x1FE.00FF.FFFF allows access to internal registers within UltraSPARC-IIiIO This portion of the address space is big-endian and there is no byte twisting done for accesses within this range. There is a large region of unused/reserved address space from 0x1FE.0202.0000 to 0x1FE.FFFF.FFFF. Reads to this address range return zero and writes are simply ignored. The remaining address regions are little-endian. The upper 4 GB, from 0x1FF.0000.0000 to 0x1FF.FFFF.FFFF is used for accesses to PCI bus memory space. The 16 MB region from 0x0.0100.0000 to 0x0.01FF.FFFF is used for access to PCI configuration space, and there are two 64 kB regions from 0x0.0200.0000 to 0x0.02FF.FFFF that are used to access PCI bus I/O space. All of these address ranges are little-endian, and all accesses to them use byte twisting.
Note – This means that any configuration and status registers in the APB ASIC
must be accessed with little-endian loads and stores, or they will appear byte twisted. All configuration and status registers within UltraSPARC-IIi are accessed with big-endian loads and stores, except for those used to access the PCI configuration space. If the UltraSPARC-IIi PCI bridge ASIC provides the path to the system PROM, the PROM is found between offsets 0x1FF.F000.0000 and 0x1FF.F0FF.FFFF. This range falls in the upper 4 GB region, that UltraSPARC-IIi considers as little-endian, and subjects to byte-twisting. In spite of the byte-twisting, and because of the way the PROM is programmed, this PROM appears to the system correctly as a big-endian device. An explanation of this mechanism is detailed in succeeding sections.
9.4.2.2
Byte Twisting
FIGURE 9-1 shows how data is manipulated from a 32-bit little-endian PCI bus to 64-
bit big-endian UltraSPARC-IIi busses.
90
UltraSPARC-IIi User’s Manual • October 1997
63 0 1 2 3 4 5 6 7 63
0
UltraSPARC Core, UPA64S or DRAM
Memory
0
UltraSPARC-IIi
63 addr[2]=1 addr[2]=0 0
PBM
0 1 2 31 0 3
4 5 6 7
Memory
PCI bus
FIGURE 9-1
UltraSPARC-IIi Byte Twisting
Chapter 9
PCI Bus Interface
91
9.4.3
9.4.3.1
Specific Cases
PIOs
Normal
All byte sized PIOs work correctly. The byte lane used for a given address on the big-endian side is directly wired to the byte lane used for that address on the littleendian side. Byte twisting is insufficient for any access larger than a byte. For example, if the 32bit value 0x12345678 is written to a 32-bit register on a PCI device, the PCI device sees the value 0x78563412 instead. The UltraSPARC core has special support to correct this By either marking the page containing the PCI register as little-endian in the processor’s MMU, or by using one of the little-endian ASIs, UltraSPARC-IIi will alter its ordering of the bytes so that the PCI device correctly sees 0x12345678.
PROM accesses
Instruction fetches from the PROM are a special case because they are unable to use the little-endian features. PROM instruction fetches, like all instruction fetches, are always done in big-endian mode. In UltraSPARC-IIi systems, the PROM could be a byte device on an 8 byte bus, controlled by an integrated IO controller (or SuperIO) IC. This SuperIO could stack the bytes in little-endian format, such that the byte at address 0 in the PROM appears on PCI bus data bits 7:0, byte 1 on bits 15:8, and so on. To function correctly with the byte-twisting of UltraSPARC-IIi, and in the absence of any other byte reordering by the processor, the PROM must be programmed in big-endian order – byte 0 in the PROM should be the MSB of the first instruction. Because of this required byte programming ordering for the PROM, data accesses to the PROM should not use the little-endian byte reordering of the processor, even though the PROM is located within the little-endian PCI space. If only big-endian accesses are made to the PROM, PIOs of any size will return data with the correct byte order. Note that use of a SuperIO IC may require different ordering of the bytes in the PROM to make UltraSPARC-IIi references work correctly.
92
UltraSPARC-IIi User’s Manual • October 1997
9.4.3.2
DMA
Data streams
DMA of byte streams works correctly without further intervention. A PCI device that receives the byte stream (01,02,03,04) packs the bytes into a 32-bit register starting with the LSB of the register, that is, 0x04030201. After transferring to memory on the PCI bus, the value 0x01 occurs at the lowest memory location, as required. After byte twisting, the value given to the UltraSPARC core would be 0x01020304. Since the MSB is the lowest memory location, the value 0x01 is still stored at the lowest memory location, as required.
Descriptors
Byte twisting is insufficient for any access larger than a byte, just as for PIOs. With byte twisting used alone, a DMA descriptor access would retrieve the wrong byte ordering. For example, if the value 0x12345678 were set up as an address in a descriptor, the PCI device interprets this value as 0x78563412 instead. To avoid this, the UltraSPARC core little-endian features are used again. Processor loads and stores to the descriptors should be specified as little-endian. This will reorder the bytes in memory so that after byte twisting, the PCI device sees the correct value.
Chapter 9
PCI Bus Interface
93
94
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
10
UltraSPARC-IIi IOM
The IO Memory Management Unit (IOM) performs virtual to physical address translation during DVMA cycles. PCI master devices provide a 32-bit virtual address at the beginning of a DVMA transfer, which the IOM translates into 34 bits of physical address. UltraSPARC-IIi contains 16-entry fully-associative Translation Lookaside Buffers (TLBs) and a a one-level, software-managed data structure called a Translation Storage Buffer(TSB). The TLB stores recently used translation information. Hardware performs a TSB lookup (also known as hardware table walk) when the translation cannot be found in the TLB. If a TSB lookup fails to locate a valid mapping, the IOM returns an error to the PCI master device. The IOM supports alternative page sizes of 8K and 64K. Mixed page sizes can be used in the system but the TSB table lookup assumes the smaller page size. No page overlapping is allowed. Operation in Bypass mode allows devices with their own translation facility to bypass IOM.
95
10.1
Block Diagram
NC PA DMA Interface for Table Walks 34 TLB CAM
VA PA HIT
32 34 UltraSPARC-IIi PCI IOM & PBM
DATA CTRL
CTRL TLB RAM PA 12 PIO Interface to access TLB & internal Regs DATA CTRL ARB/ CTRL
FIGURE 10-1
IOM Top Level Block Diagram
10.2
TLB Entry Formats
A TLB entry consists of TLB tag in the CAM and TLB data in the RAM.
10.2.1
TLB CAM Tag
24 23 22 21 20 W S 19 SIZE 18 VA[31:13] 0 ERRSTS ERR
FIGURE 10-2
TLB CAM Tag Format
FIGURE 10-2 shows the bit fields of the TLB CAM Tag. These assignments are explained in TABLE 10-1.
96
UltraSPARC-IIi User’s Manual • October 1997
TABLE 10-1 Field
Description of TLB Tag Fields
Bits Description Type
ERRSTS
24:23
Error Status: 00 = Reserved 01 = Invalid Error 10 = Reserved 11 = UE Error (on TTE read) When set to 1, indicates that there is an error associated with this entry Writable; when set, the page mapped by this TLB has write permission. Stream; Ignored by UltraSPARC-IIi 0 means 8K page, 1 means 64K page 19-bit VPN (Virtual Page Number)
RW
ERR W S SIZE VA [31:13]
22 21 20 19 18:0
RW RW RW RW RW
For an IOM miss, if the returned TTE data has Valid = 0, or lacks the appropriate write privilege, or has an uncorrectable ECC error (UE), the IOM adjusts the ERR_STS[1:0] to reflect the error, and sets ERR == 1 and Valid == 1. The error is reported by the DMA master as a Target Abort. The PBM will also log its target-abort generation with the STA bit in the PCI Configuration Space Status Register. The Valid bit for the entry is set, regardless of the state of the valid bit in the TTE data, so the DMA transaction does not cause another IOM miss. Software is responsible for flushing the IOM entry when it rectifies the missing TSB entry or bad DMA address. If a VA hit results in a protection error, the IOM state is not modified.
Chapter 10
UltraSPARC-IIi IOM
97
10.2.2
TLB RAM Data
30 29 28 V
FIGURE 10-3
27:21 0s or 1s
20 PA[33:13]
0
U
C
TLB RAM Data Format
TABLE 10-2 Field
TLB Data Format
Bits Description Type
V U C PA[40:34] PA[33:13]
30 29 28 27:21 20:0
Valid bit; when set, the TLB data field is meaningful Used bit; affects the LRU replacement. Cacheable bit; 1=Cacheable access; 0=Noncacheable. Not stored; all 1s if Noncacheable; all 0s if cacheable. 21-bit physical page number
RW RW RW R RW
10.3
s s s
DMA Operational Modes
There are three different operational DMA IOM modes: translation, bypass, and pass-through. The applicable mode depends upon: The value of the “MMU_EN” bit of the IOM Control Register The PCI addressing mode used: DAC using 64 bits or SAC using 32 bits The PCI virtual address – bits 31:29 in SAC mode or bits 63:50 in DAC mode
PCI DMA Modes of Operation
MMU_EN Addr Result
TABLE 10-3 Mode
ad[31:29]
SAC
miss
X
N/A
PCI peer-to-peer (Ignored by UltraSPARC-IIi)
Pass-through
SAC
hit
0
N/A
98
UltraSPARC-IIi User’s Manual • October 1997
TABLE 10-3 Mode
PCI DMA Modes of Operation (Continued)
MMU_EN Addr Result
ad[31:29]
SAC DAC DAC
hit X X
1 X X
N/A 0x00000x3FFE 0x3FFF
IOM Translation (DMA) Ignored by UltraSPARC-IIi Bypass (DMA)
The Target Address Space Register is used to decide if AD[31:29] is a hit.
10.3.1
Translation Mode
The PBM block initiates the translation by providing a 32-bit virtual address. The IOM hardware performs the following actions in order, beginning with a TLB lookup, until a valid mapping or an error results. 1. If the lookup results in TLB hit, the IOM returns a 34 bit physical address. 2. If a TLB miss occurs, hardware automatically starts a TSB lookup. 3. If the TSB lookup locates a valid mapping for the virtual page, information in the TSB entry is loaded into the TLB and translation continued. 4. If the TSB lookup results in a miss, an error is returned to the PBM. The virtual address consists of two fields: virtual page number and page offset. Page offset is from virtual address to physical address. The conversion of virtual address to physical address for page sizes 8K and 64K is shown below.
31 Virtual Page Number Translation 33 Physical Page Number
FIGURE 10-4
13 12 Page Offset
0 PCI
13 12 Page Offset
0 PA
Virtual to Physical Address Translation for 8K Page Size
Chapter 10
UltraSPARC-IIi IOM
99
31 16 15 Virtual Page Number Translation 33 Physical Page Number
FIGURE 10-5
0 Page Offset PCI
16 15 Page Offset
0 PA
Virtual to Physical Address Translation for 64K Page Size
10.3.2
Bypass Mode
The IOM allows PCI devices to have their own MMU and bypass the IOM supported by the system. A PCI device is operating in bypass mode if all conditions in the last row in TABLE 10-3 are met. In this mode, the physical address PA[33:0] = PCI_ADDR[33:0].
63 0x3FFF
50
34
33 Physical Page Number 33 Physical Page Number
0 Page Offset PCI 0 Page Offset PA
FIGURE 10-6
Physical Address Formation in Bypass Mode (8K and 64K)
A PCI device operating in bypass mode has direct access to the entire physical address space. Bit [34] of PCI_ADDR indicates whether the PCI device is accessing the coherent space, where (PA[34] = 0), or the UPA64S or IO space, where (PA[34] = 1).
100
UltraSPARC-IIi User’s Manual • October 1997
10.3.3
Pass-through Mode
The IOM operates in pass-through mode if all conditions listed in the first row in TABLE 10-3 are met. Pass-through mode allows access to the coherent address space (DRAM) only. Higher bits of physical address are padded with 0.
31 Physical Page Number 33 00
FIGURE 10-7
0 Page Offset 0 Physical Page Number Page Offset PA PCI
32 31
Physical Address Formation in Pass-through Mode (8K and 64K)
10.4
Translation Storage Buffer
The Translation Storage Buffer, or TSB, is a translation table in memory. It contains one-level mapping information for the virtual pages. IOM hardware looks up this table if a translation cannot be found in the TLB. A TSB entry is called Translation Table Entry, or TTE, and is eight bytes long. The system.supports several TSB table sizes and specifies the size with the TSB_SIZE field of the IOM Control Register. The possible table sizes are 1K, 2K, 4K, 8K, 16K, 32K, 64K and 128K entries (not bytes) which supports DMA address space of 8M to 1G for an 8K page, and 64K to 2G for a 64K page (128K and 64K TSB sizes are not supported with a 64K page). Software must set up the TSB before it allows translation to start.
Chapter 10
UltraSPARC-IIi IOM
101
10.4.1
Translation Table Entry
Translation Table Entries (TTE) contain translation information for virtual pages. The IOM hardware reads one TTE during a table walk and stores it in the TLB. A TTE entry has valid information only when the DATA_V bit is set. TABLE 10-4 shows the contents of the TTE.
TTE Data Format
Bits Description
TABLE 10-4 Field
DATA_V DATA_SIZE STREAM LOCALBUS DATA_SOFT_2 DATA_PA
Valid bit (1 = TTE entry has valid mapping) Page size of the mapping (0 = 8K; 1 = 64K) Stream bit (1 = streamable page; 0 = consistent page) Local bus bit; not used Reserved for software use Contains bits of physical address; bits 15:13 are not used for 64K page; bits are not used and implied to be 1 if noncacheable, 0 if cacheable. Reserved for software use Cacheable (1 = cacheable page, 0 = non-cacheable page); not used Set if this page is writable
DATA_SOFT CACHEABLE DATA_W
TTE data is stored in main memory, in the software-managed TSB. All other bits are reserved.
10.4.2
TSB Lookup
During the TSB lookup, the physical address for the TTE entry is formed based on the following information.
s s
s
Base address of the TSB table Page size assumed during TSB lookup (as specified by the TBW_SIZE bit in IOM Control Register) TSB table size
The TSB Base Address Register contains the physical address of the first TTE entry in the TSB table. The lower order 13 bits of this register are all zeros because the TSB table must be aligned on an 8K boundary regardless of TSB size. Physical address for
102
UltraSPARC-IIi User’s Manual • October 1997
an entry in TSB table is formed by adding the base address and an offset generated as shown in TABLE 10-5. The lower order three bits of the offset are set to 0x0 because each TTE entry is eight bytes long.
Offset to TSB Table
N Offset (8K TSB lookup page size) (TBW_SIZE=0) Offset (64K TSB lookup page size) (TBW_SIZE=1)
TABLE 10-5
TSB Table Size
1K 2K 4K 8K 16K 32K 64K 128K
12 13 14 15 16 17 18 19
[VA, 000] [VA, 000] [VA, 000] [VA, 000] [VA, 000] [VA, 000] [VA, 000] [VA, 000]
[VA, 000] [VA, 000] [VA, 000] [VA, 000] [VA, 000] [VA, 000] Not allowed1 Not allowed1
1. UltraSPARC-IIi does not detect illegal combinations, and its behavior is unspecified for such combinations. Software must ensure they do not occur.
33
Base Address
13 12
0 000000000000
N
Offset
3 2 000
0
Add
33 TTE Entry Physical Address
0
FIGURE 10-8
Computation of TTE Entry Address
TBW_SIZE should be set to 0 if 8K page size or mixed (8K and 64K) page sizes is used for DMA mappings. If mixed page sizes is used, each 64K page will use up 8 entries of TTE. Software must fill all 8 entries with the same information.
Chapter 10
UltraSPARC-IIi IOM
103
10.5
PIO Operations
To prevent random PIO operations from interfering with the internal states of the translation, the IOM implements an interlocking mechanism. This mechanism is described below.
s
s s
s
No PIO operation to the IOM is allowed during address translation for any DMA operation. No PIO operation to the IOM is allowed during service of TLB Miss. For a pending PIO request, the IOM begins the PIO operation once it completes the current translation or TLB miss service. In other words When the IOM is in idle state, it gives higher priority to PIO requests than address translations.
10.6
Translation Errors
Translation errors detected by the IOM are:
s
s
s
Invalid Errors: An invalid error happens if bit DATA_V in the TTE read by IOM hardware indicates that the TTE is invalid (DATA_V = 0). Protection Errors: A protection error is detected if the PCI device is doing DMA write to a page which is mapped as read-only (bit W = 0 in the TLB tag or bit DATA_W = 0 in the TTE). TTE UE Error: If a correctable ECC error occurred during table walk, the MCU will correct the error and the TTE received by the IOM is error free. If the ECC error is uncorrectable, the received TTE will be invalid and the IOM will flag an error.
Compatibility Note – There are no time out errors during table walk for the
UltraSPARC-IIi IOM.
104
UltraSPARC-IIi User’s Manual • October 1997
Compatibility Note – Bits in the DMA UE AFSR/AFAR are set, and the PA of the
TTE entry is saved on Invalid, Protection (IOM miss), and TTE UE errors. This should aid debugging of software errors. If the Protection error had an IOM hit, the translated PA from the IOM is saved instead of the PA of the TTE entry. This may occur if a prior DMA read caused the IOM entry to be installed.
10.7
IOM Demap
After establishing mapping between virtual and physical addresses, implementing a change must include a demap of this existing mapping before a new mapping can be used by the device. Demap is required when taking down existing mapping to make physical memory available to other virtual addresses, or when changing access permission for a page. During IOM demap, the PCI device is not allowed to use the page being demapped. If a device attempts to access a page currently being demapped, unexpected results may occur. The following events are needed to demap a page in the IOM.
s s
TSB entry properly updated with new information TLB flush performed with virtual page number
TLB flush is initiated by writing to the IOM Flush Address Register with the specified virtual page number. Match criteria are different for 8K and 64K page sizes. Hardware performing the flush adjusts matching criteria based on the page size. The matched entry in the TLB will be marked invalid.
10.8
Pseudo-LRU replacement algorithm
Compatibility Note – Prior PCI-based UltraSPARC systems implemented a true
LRU scheme. The UltraSPARC-IIi IOM uses a 1-bit LRU scheme, just like the UltraSPARC MMUs. Each TLB entry has an associated “Valid,” and “Used” bit. On an automatic write to the TLB after a hardware tablewalk, the TLB picks the entry to write based on the following rules: 1. If any entry is not Valid, the first such entry will be replaced (measuring from TLB entry 0). If not, then:
Chapter 10 UltraSPARC-IIi IOM 105
2. If any entry is not Used, the first such entry will be replaced (measuring from TLB entry 0). If not, then: 3. All but one Used bit will be reset, then the process is repeated from Step 2 above. All replacements can also be forced to a single entry.
10.9
TLB Initialization and Diagnostics
The IOM provides direct access to its internal resources, such as TLB Tag, TLB Data, and Match Comparison Logic. After power is turned on, the contents of the IOM are undefined. Before any DMA is allowed to use the IOM, all TLB entries must be invalidated by writing 0s to them.
106
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
11
Interrupt Handling
11.1
Overview
The “Mondo” interrupt transfer mechanism for Sun4u systems reduces interrupt service overhead by directly identifying the unique interrupter, without polling multiple status registers. SPARC V9 CPUs provide a dedicated set of registers to be used exclusively for servicing interrupts. This eliminates the need for the processor to save its current register set to service an interrupt, and then restore it later. An interrupt packet contains a Mondo vector which has three double words designed to assist the processor in servicing the interrupt. Limitations of the Mondo vector approach include:
s s
Only one interrupt request packet can be serviced at a time. There is no priority level associated with Mondo vector interrupts; they are serviced on a first come, first served basis.
This interrupt packet delivery now happens inside UltraSPARC-IIi, rather than being visible on the UPA interconnect. Since it is an internal dedicated uniprocessor path, the flow control issues are simpler, and no interrupt retry is needed. UltraSPARC-IIi just causes one interrupt packet delivery at a time, after each acknowledgment by software (clearing of the MVR_BUSY bit in the mondo receive trap handler).
107
11.1.1
Mondo Dispatch Overview
UltraSPARC-IIi’s PIE logic block is responsible for fielding interrupts from external PCI sources, other external sources, and internal UltraSPARC-IIi sources, loading the mondo data receive registers, and signalling a mondo receive trap to the UltraSPARC-IIi pipeline. External interrupt sources include 8 PCI slots on two separate PCI busses, the onboard IO devices, a graphics interrupt, and the expansion UPA slot. These interrupts are concentrated in an external ASIC and presented to the Mondo Unit one at a time. This saves pins on UltraSPARC-IIi. Internal interrupt sources include ECC (errors) and PBM (PCI bus errors). Each of the 8 PCI slots have 4 interrupts. However, with the current RIC chip, only 26 PCI interrupt requests can be connected. The documentation assumes these interrupts are mapped to certain slots and INTAD wires. System designers are free to distribute the PCI interrupt wires differently, but system software will need a new mapping of PCI slots, and related CSRs. The CSRs and logic are implemented so that 32 PCI interrupts can be handled, if required, using a new RIC IC.
11.2
11.2.1
Mondo Unit Functional Description
Mondo Vectors.
The Sun4u architectural specification states that interrupts are delivered to the process potentially using three double words used to carry “pertinent” information. Note that UltraSPARC-IIi does not deliver interrupt data, only the Interrupt Number. Reads of Mondo Data Receive registers 1 and 2 always return 0.
108
UltraSPARC-IIi User’s Manual • October 1997
63
10
0 Int Num
63 Data 1 63 Data 2 63
0
0
0
FIGURE 11-1
Mondo Vector Format
The first data register contains the interrupt number (11 bits). The interrupt number is specific to each interrupt source. The CPU can process only one interrupt at a time. The Mondo Dispatch Unit is responsible for remembering all interrupts that have arrived, and serializing them to the CPU pipeline as traps. In addition, it tracks the state of pending DMA writes in the APB and UltraSPARC-IIi, and guarantees that all DMA writes completed on the Secondary PCI buses (temporally) before a PCI interrupt request, complete to memory before notifying the CPU.
11.2.1.1
DMA synchronization
After receiving a any external interrupt request, the PIE checks whether the two SB_EMPTY lines are asserted, indicating no pending DMA writes inside external APB ASICs. If SB_EMPTY, the PIE then checks there are no pending DMA writes to the MCU. If either empty indication were false, the PIE asserts SB_DRAIN, blocking arrival of future DMA writes (some may arrive during the transmission time). The PIE then waits for both SB_EMPTY assertions, and then further waits for the internal EMPTY assertion. At this point the trap may be delivered, and all other pending interrupts marked as “synchronized”, so that this process is again unnecessary when these arrive at the CPU. The PIE deasserts SB_DRAIN once it sees that DMA writes are successfully cleared from both APB and the MCU/PBM.
Chapter 11
Interrupt Handling
109
SB_DRAIN does not have to block any other external PCI activity, as long as the SB_EMPTY and MCU/PBM DMA activity signals only reflect the status of pending DMA writes. There is no deadlock, since the MCU can only forward DMA writes to slave devices, i.e. memory and UPA64S. There is a read-only CSR available that causes this DRAIN-EMPTY protocol to be activated by a noncacheable load. The load does not complete until the DRAINEMPTY synchronization protocol completes. This allows software to synchronize against outstanding DMA writes when there is a standard PCI bus bridge beyond the APB. (First issue a PIO read to the far bus bridge, then after completion, synchronize against APB and UltraSPARC-IIi using the CSR read).
11.2.1.2
Interrupt Number Register
Generally, each interrupt source has an Interrupt Number Register (INR) associated with it. The INR is either fully or partially software programmable and contains the Interrupt Number and a valid bit which enables or disables the interrupt.
31 30 V
26 25 Reserved
11 10 Interrupt Number
0
Target Processor
FIGURE 11-2
Full INR Contents
As shown the INR has 3 fields: 1. Valid bit (1 bit) - enables the interrupt when set to 1. Note that when an interrupt is present and the valid bit is 0, the interrupt is prevented from being delivered. However, once the valid bit is set to 1, the interrupt is delivered. 2. Target Processor (5 bits) - Read-only as 0 for UltraSPARC-IIi. 3. Interrupt Number (11 bits) For most interrupts, the Interrupt Number field is further broken down into two separate fields: the Interrupt Group Number (IGN) and the Interrupt Number Offset. The Interrupt Number Offset (INO) is a fixed value depending on the interrupt.
Compatibility Note – The IGN on UltraSPARC-IIi is not programmable, and fixed
to 0x1F.
110
UltraSPARC-IIi User’s Manual • October 1997
31 30 V
26 25 Reserved
11 10 Int. Group. Number
6
5
0 Int. Num. Offset
Target Processor
FIGURE 11-3
Partial INR Contents
External Interrupts
External Interrupts refer to those interrupts that are generated external to UltraSPARC-IIi. All external sources for interrupts (PCI, OBIO, Graphics, and UPA64S) go through the Interrupt Concentrator, a RIC ASIC.
UltraSPARC-IIi
RIC ASIC INT_NUM 6
4 4 4 4 4 4 12 /
PCI_A0_INT_ PCI_A1_INT_ PCI_B0_INT_ PCI_B1_INT_ PCI_B2_INT_ PCI_B3_INT_ OBIO Graphics UPA64S
FIGURE 11-4
Interrupt Concentrator
The Interrupt Concentrator simply samples all interrupts lines in round-robin fashion, and presents one of them at a time to UltraSPARC-IIi. To save package pins, the 38 interrupt lines are simply encoded into a 6 bit value that passes to UltraSPARC-IIi.
s
s s
PCI - UltraSPARC-IIi supports 8 total PCI slots on two separate busses. Each PCI slot has 4 interrupt lines. RIC only supports 26 of these. On-board IO Devices (OBIO) - There are 12 interrupts from OBIO devices. Graphics/UPA - 2 UPA slot interrupts are supported. These are the only two interrupts that are of pulse type (see below). These are also the only interrupts with the complete, fully software programmable, INR register. All other interrupts have IGN and INO fields.
Chapter 11
Interrupt Handling
111
11.2.1.3
Priority
Each interrupt has a priority associated with it. There are eight priority levels. priority 8 is the highest and priority 1 is the lowest. Priority is taken into account during interrupt arbitration. When multiple interrupts are present, the highest priority interrupt is delivered first. If multiple interrupts with the same priority are present, they are delivered in a round-robin fashion. When all interrupts at the highest priority level are delivered, the next highest priority level is processed.
Interrupt Receiver State Register
Number of Interrupts Source
TABLE 11-1 Level
8 7
6 6
Audio Record, Power Fail, Floppy, UE ECC, CE ECC, PBM error Kbd/mouse/serial, Serial Int, Audio Playback PCI_A0_INTA#, PCI_A1_INTA# PCI_B0_INTA#, PCI_B1_INTA# PCI_B2_INTA#, PCI_B3_INTA# PCI_A2_INTA#,PCI_A3_INTA# OB Graphics, UPA64S Int PCI_A0_INTB#, PCI_A1_INTB# PCI_A0_INTC#, PCI_A1_INTC# PCI_A2_INTB# Keyboard Int, Mouse Int PCI_B0_INTB#, PCI_B1_INTB# PCI_B2_INTB#, PCI_B3_INTB# PCI_A3_INTB# SCSI Int, Ethernet Int PCI_B0_INTC#, PCI_B1_INTC# PCI_B2_INTC#, PCI_B3_INTC# Parallel Port, Spare Int PCI_A0_INTD#, PCI_A1_INTD# PCI_A2_INTC#, PCI_A3_INTC# PCI_B0_INTD#, PCI_B1_INTD# PCI_B2_INTD#, PCI_B3_INTD# PCI_A2_INTD#, PCI_A3_INTD#
6
6
5
7
4
7
3
6
2
6
1
6
11.3
Details
Three registers are loaded with data on each interrupt.
112
UltraSPARC-IIi User’s Manual • October 1997
For UltraSPARC-IIi, the upper 53 bits of the first interrupt word as well as the last two 64 bit words are 0. The least significant 11 bits of the first word contain an interrupt number (INR) which indicates the type of interrupting event. Software uses the INR to index into a table which will typically supply the IRL, PC of the interrupt service routine, and the arguments for the routine. Two types of interrupt lines enter the concentrator: pulse and level. The distinction between these is not visible to software but is explained for clarity. Processing hardware treats these types of interrupts slightly differently. In the case of the level interrupt, the concentrator takes the set of asserted level interrupt lines, scans them and sends the code corresponding to that interrupt once per scan time. Hardware within the UltraSPARC-IIi detects the first assertion of a code, and causes a state transition which queues an interrupt packet for the UltraSPARC-IIi core. A three state FSM transmits only one interrupt (provided it remains in the PENDING state) regardless of how many interrupt codes it receives from a source. A software write causes a transition to the IDLE state and “rearms” the FSM to accept another interrupt. Pulse interrupts are scanned and delivered to UltraSPARC-IIi in a similar fashion; however, only one code is given per pulse. The distinction is subtle, but very important. In the case of the existing interrupts, multiple interrupt sources can contribute to the physical line signalling the interrupt, but there is no restriction which guarantees that software knows that the interrupt line has properly deasserted. In the case of pulse interrupts, this is required. There must be the equivalent of the pending register in the device sourcing the interrupt. Writing to this register guarantees that the interrupt line has been deasserted and therefore pulsed. As a consequence, the state machine in the UltraSPARC-IIi that corresponds to a pulse interrupt has only two states. Refer to “Interrupt States” on page 117 for a discussion of the state transitions.
11.4
Interrupt Initialization
All fields in all mapping registers listed above reset to 0. When the valid bit is cleared, no interrupts are generated from that interrupt group. Prior to receiving the first interrupt, software must program all mapping registers to set INR. Hardware guarantees that any transaction not in progress when the valid bit is disabled does not proceed. Once the valid bit is enabled again, interrupts proceed.
Chapter 11 Interrupt Handling 113
Note the valid bit only gates delivery of interrupts to the processor. It does not affect other state transitions within the interrupt logic. An interrupt can be delivered immediately upon first setting the valid bit if an interrupt condition exists.
11.5
Interrupt Servicing
Upon receipt of an interrupt, and assuming that PSTATE.IE=1, the UltraSPARC-IIi core will take a type 0x60 trap. The INR is used to index into a table which provides three pieces of information: the IRL, the PC for the interrupt service routine, and the arguments that need to be supplied. A SOFINT trap is issued to call the interrupt service enqueue routine with this information. When the interrupt service routine has performed all device level servicing, it calls an operating system service to dequeue it. This OS service must write the clear interrupt register for the appropriate interrupt source in order to re-enable interrupts from that source. Information in the appropriate clear interrupt register should be saved at the time of enqueue.
Note – The UltraSPARC-IIi core uses PSTATE.IE to enable the generation of trap for
IRL[4:0]. Software should not disable PSTATE.IE for a long period of time when servicing IRL[4:0].
11.6
Interrupt Sources
Interrupts in UltraSPARC-IIi systems come from I/O devices, system error conditions, and software. Examples of sources of I/O device interrupts are PCI slots and the graphics interface. All I/O device interrupts are connected to the Interrupt Concentrator (the RIC IC). The Interrupt Concentrator scans through its inputs and encodes the interrupt into 6-bits for UltraSPARC-IIi. UltraSPARC-IIi maintains state information on all of the interrupt sources and sends an interrupt packet to the proper processor. A unique interrupt number can be assigned to each interrupt signal line connected to the Interrupt Concentrator. The interrupt number allows the software to identify the interrupt source without polling devices. Excepting the serial ports and the keyboard and mouse, system devices do not share interrupts. There are no outgoing interrupts from the processor.
114
UltraSPARC-IIi User’s Manual • October 1997
11.6.1
PCI Interrupts
The 24 (6 slot) interrupts of prior PCI-based UltraSPARC systems are supported. eight interrupts for two more slots are also supported, although RIC does not support all the INT_NUM[4:0] encodings that are specified.
11.6.2
On-board Device Interrupts
Additional interrupts are available for use by non-PCI devices or integrated I/O devices with more interrupt requests.
11.6.3
Graphic Interrupt
During the vertical blanking period, the UPA64S device can generate an interrupt that is fed to the interrupt concentrator. Masking and clearing the UPA64S interrupt is done through the UPA64S ASIC register.
11.6.4
Error Interrupts
Internal errors detected by the PCI logic in UltraSPARC-IIi are generally reported through interrupts. Error related information is recorded in UltraSPARC-IIi internal registers. Refer to Chapter 16, “Error Handling” for details. Since the Advanced PCI Bridge (APB) can delay the completion of writes, it may cause a late error report that it cannot complete the write on the secondary PCI busses. APB logs status associated with this error, and signals an error (SERR) to UltraSPARC-IIi, which causes an interrupt.
11.6.5
Software Interrupts
The processor can send an interrupt to itself by setting bits in the UltraSPARC-IIi SOFTINT Register.
Chapter 11
Interrupt Handling
115
11.7
Interrupt Concentrator
The Interrupt Concentrator logic is implemented in a Reset/interrupt/Clock Controller (RIC) chip, part number STP2210QFP, to encode interrupts from various sources into a 6-bit code that UltraSPARC-IIi IO uses to identify the interrupt source. The code assignment is transparent to the software. See TABLE 11-4.
Note – A value of all ones in INT_NUM indicates the idle condition.
The Interrupt Concentrator scans the interrupt inputs in fixed order. If there is no active interrupt, the IDLE code is sent to UltraSPARC-IIi. When it detects an active interrupt, the Interrupt Concentrator changes the code from IDLE to one of the active codes. It can deliver one interrupt code to UltraSPARC-IIi every PCI clock cycle with an initial latency of three clock cycles. If multiple interrupts are active at the same time, the interrupts behind the current one observe the latency due to the Interrupt Concentrator. The worst case latency introduced by the Interrupt Concentrator is 50 PCI clock periods. This figure only describes the latency from the assertion of an interrupt line to the receipt of the interrupt code in the UltraSPARC-IIi. The Interrupt Concentrator does not keep track of any state for level interrupts. For pulse interrupts, it tracks the assertion of the interrupt, and transmits only one code for each assertion. Filter logic within the chip inhibits sending additional codes to UltraSPARC-IIi until the interrupt signal is deasserted. TABLE 11-2 lists the edgesensitive interrupts.
INT Code Assignments for Edge-sensitive Interrupts
Interrupt Source Graphics Interrupt Spare edge sensitive interrupt
TABLE 11-2
INT Code 0x23 0x26
Level interrupt codes are sent to the UltraSPARC-IIi whenever there is a currently active interrupt. The UltraSPARC-IIi must ignore incoming interrupt code when an interrupt has been detected.
116
UltraSPARC-IIi User’s Manual • October 1997
11.8
11.8.1
UltraSPARC-IIi Interrupt Handling
Interrupt States
Interrupts generated by I/O devices are of level or pulse type and are converted into UPA interrupt packets. UltraSPARC-IIi must track of the state of each level interrupt to avoid reacting to an interrupt that the processor already received. The three FSM states are IDLE, XMIT, and PEND. Pulse interrupts only use IDLE and XMIT.
Interrupt State Transition Table
Description
TABLE 11-3
State Transition
IDLE -> XMIT XMIT -> PEND XMIT -> IDLE PEND -> IDLE
An active interrupt is detected from Interrupt Concentrator. The interrupt has been delivered to the processor. This transition is present only for the three state version. The interrupt has been delivered to the processor. This transition is present only for the two state version. The interrupt has been cleared by software.
Note – The PEND state is to indicate that the interrupt was already sent to the
UltraSPARC-IIi core and is not yet cleared. For the state machine to transition to this state, the valid bit in the mapping register must be set. Interrupts for which the valid bit is not set can transition to the XMIT state, but may not dispatch to the UltraSPARC-IIi core. The interrupt state information can be obtained from Interrupt State Registers in UltraSPARC-IIi. Two bits in each register define the state of a interrupt. Please refer to Section 19.3.3, “Interrupt Registers” on page 313 for a description of the registers.
11.8.2
Interrupt Prioritizing
If there are multiple interrupts in the XMIT state, their dispatch is based on a fixed priority. Between interrupts of the same priority, round-robin priority arbitration is applied.
Chapter 11
Interrupt Handling
117
11.8.3
Interrupt Dispatching
UltraSPARC-IIi maintains an interrupt number lookup table as shown in TABLE 11-4. The Interrupt Vector Data Registers in UltraSPARC-IIi are used to store the INR created from this lookup. After an Interrupt Vector Data Register is loaded with data, the UltraSPARC-IIi core must not receive another interrupt until it empties the register. Loading interrupt data into an Interrupt Vector Data Register sets the Interrupt Vector Receive Register “Busy” bit. This bit indicates to the UltraSPARC-IIi IO that it must neither send another interrupt to the UltraSPARC-IIi core, nor load an Interrupt Vector Data Register until this bit is cleared. The “Busy” bit can also be cleared by software. After the UltraSPARC-IIi core receives the interrupt, an interrupt trap is generated if IE bit of PSTATE Register is set to 1. The trap type for the interrupt trap is 0x60.
118
UltraSPARC-IIi User’s Manual • October 1997
TABLE 11-4 RIC pin Interrupt
Summary of Interrupts
Int/Ext Source INT_NUM (from RIC) Type Offset Priority
SB0_INTREQ7 SB0_INTREQ5 SB2_INTREQ5 SB0_INTREQ2 SB1_INTREQ7 SB1_INTREQ5 SB3_INTREQ5 SB1_INTREQ2 SB2_INTREQ7 (no RIC support) (no RIC support) SB2_INTREQ2 (no RIC support) (no RIC support) (no RIC support) SB3_INTREQ2 SB0_INTREQ6 SB0_INTREQ4 SB0_INTREQ3 SB0_INTREQ1 SB1_INTREQ6 SB1_INTREQ4 SB1_INTREQ3 SB1_INTREQ1 SB2_INTREQ6 SB2_INTREQ4 SB2_INTREQ3 SB2_INTREQ1 SB3_INTREQ6 SB3_INTREQ4 SB3_INTREQ3 SB3_INTREQ1 SCSI_INT ETHERNET_INT PARALLEL_INT
PCI A Slot 0, INTA# PCI A Slot 0, INTB# PCI A Slot 0, INTC# PCI A Slot 0, INTD# PCI A Slot 1, INTA# PCI A Slot 1, INTB# PCI A Slot 1, INTC# PCI A Slot 1, INTD# PCI A Slot 2, INTA# PCI A Slot 2, INTB# PCI A Slot 2, INTC# PCI A Slot 2, INTD# PCI A Slot 3, INTA# PCI A Slot 3, INTB# PCI A Slot 3, INTC# PCI A Slot 3, INTD# PCI B Slot 0, INTA# PCI B Slot 0, INTB# PCI B Slot 0, INTC# PCI B Slot 0, INTD# PCI B Slot 1, INTA# PCI B Slot 1, INTB# PCI B Slot 1, INTC# PCI B Slot 1, INTD# PCI B Slot 2, INTA# PCI B Slot 2, INTB# PCI B Slot 2, INTC# PCI B Slot 2, INTD# PCI B Slot 3, INTA# PCI B Slot 3, INTB# PCI B Slot 3, INTC# PCI B Slot 3, INTD# SCSI Ethernet Parallel Port
Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext Ext
PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI PCI
0x07 0x05 0x15 0x02 0x0F 0x0D 0x1D 0x0A 0x17 0x38 0x10 0x12 0x18 0x39 0x00 0x1A 0x06 0x04 0x03 0x01 0x0E 0x0C 0x0B 0x09 0x16 0x14 0x13 0x11 0x1E 0x1C
Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level Level
0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x0C 0x0D 0x0E 0x0F 0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x18 0x19 0x1A 0x1B 0x1C 0x1D
7 5 5 2 7 5 5 2 6 5 2 1 6 4 2 1 6 4 3 1 6 4 3 1 6 4 3 1 6 4
Ext Ext Ext Ext Ext
PCI PCI OBIO OBIO OBIO
0x1B 0x19 0x20 0x21 0x22
Level Level Level Level Level
0x1E 0x1F 0x20 0x21 0x22
3 1 3 3 2
Chapter 11
Interrupt Handling
119
TABLE 11-4
Summary of Interrupts (Continued) Ext Ext Ext Ext Ext Ext Ext Ext Ext Int Int Int Int Ext Ext Ext UPA64S UPA64S None 0x23 0x26 0x3F Pulse Pulse N/A OBIO OBIO OBIO OBIO OBIO OBIO OBIO OBIO OBIO ECC ECC PBM 0x24 0x1F 0x25 0x28 0x29 0x2A 0x2B 0x2C 0x2D Level Level Level Level Level Level Level Level Level Level Level Level 0x23 0x24 0x25 0x26 0x27 0x28 0x29 0x2A 0x2B 0x2C-2D 0x2E 0x2F 0x30 0x31-32 From INR From INR N/A 5 5 N/A 8 8 8 8 7 8 7 8 2 4 4 7
AUDIO_INT SB3_INTREQ7 POWER_FAIL_I NT KEYBOARD_IN T FLOPPY_INT SPARE_INT SKEY_INT SMOU_INT SSER_INT
Audio Record Audio Playback Power Fail Kbd/Mouse/Serial Floppy Spare Hardware Keyboard Mouse Serial reserved Uncorrectable ECC Correctable ECC PCI Bus Error reserved
GRAPHIC1_INT GRAPHIC2_INT
Graphics Graphics No interrupt
11.9
Interrupt Global Registers
To expedite interrupt processing, a separate set of global registers is implemented in UltraSPARC-IIi. As described in Section 11.10.5, “Interrupt Vector Receive” on page 123, the processor takes an implementation-dependent interrupt_vector trap after receiving an interrupt packet. Software uses a number of scratch registers while determining the appropriate handler and constructing the interrupt state. UltraSPARC-IIi provides a separate set of eight Interrupt Global Registers (IG) that replace the eight programmer-visible global registers during interrupt processing. When an interrupt_vector trap is taken, the hardware selects the interrupt global registers by setting the PSTATE.IG field. The PSTATE extension is described in Section 14.5.9, “PSTATE Extensions: Trap Globals” on page 200. The previous value of PSTATE is restored from the trap stack by a DONE or RETRY instruction on exit from the interrupt handler.
120
UltraSPARC-IIi User’s Manual • October 1997
11.10
Interrupt ASI Registers
Note – MEMBAR #Sync is generally needed after stores to interrupt ASI registers.
Caution – Using ASI 0x76/77/7E/7F with VA[40:39]==00 and a VA[15:0] matching
any of the PA[15:0] listed for the CSR addresses in noncacheable space, other than 0x00, 0x18, 0x20, 0x38, 0x40, 0x50, 0x60, or 0x70, can cause a load to return data, and a store to modify, the corresponding CSR. The list of addresses is in the “DMA Error Registers” on page 330.
11.10.1
Outgoing Interrupt Vector Data
Name: Outgoing Interrupt Vector Data Registers (Privileged) ASI_SDB_INTR_W (data 0): ASI== 0x77, VA==0x40 ASI_SDB_INTR_W (data 1): ASI== 0x77, VA==0x50 ASI_SDB_INTR_W (data 2): ASI== 0x77, VA==0x60
Outgoing Interrupt Vector Data Register Format
Field Use RW
TABLE 11-5 Bits
Data
Data
W
Data: Interrupt data
Compatibility Note – UltraSPARC-IIi does not send interrupts to any devices. A
write to these registers has no effect. Non-privileged access to this register causes a privileged_action trap.
11.10.2
Interrupt Vector Dispatch
Name: ASI_SDB_INTR_W (interrupt dispatch) (Privileged, write-only)
Chapter 11
Interrupt Handling
121
ASI: 0x77, VA==0, VA== target MID, VA==0x70 UltraSPARC-IIi does not send interrupts to any devices. A write to this register has no effect. A read from this ASI causes n data_access_exception trap. Non-privileged access to this register causes a privileged_action trap.
11.10.3
Interrupt Vector Dispatch Status Register
Name: ASI_INTR_DISPATCH_STATUS (Privileged, read-only) ASI: 0x48, VA==0
Interrupt Dispatch Status Register Format
Field Use RW
TABLE 11-6 Bits
Reserved NACK BUSY
— Always 0. Always 0.
R R R
NACK: Cleared at the start of every interrupt dispatch attempt; set when a dispatch has failed. BUSY: Set if there is an outstanding dispatch.
Compatibility Note – UltraSPARC-IIi does not send interrupts to any devices. A
read of this register always returns zeros. Writes to this ASI cause a data_access_exception trap. Non-privileged access to this register causes a privileged_action trap.
11.10.4
Incoming Interrupt Vector Data
Name: Incoming Interrupt Vector Data Registers (Privileged) ASI_SDB_INTR_R (data 0): ASI== 0x7F, VA==0x40 ASI_SDB_INTR_R (data 1): ASI== 0x7F, VA==0x50
122
UltraSPARC-IIi User’s Manual • October 1997
ASI_SDB_INTR_R (data 2): ASI== 0x7F, VA==0x60
Incoming Interrupt Vector Data Register Format
Field Use RW
TABLE 11-7 Bits
Data
Data
R
Data: Interrupt data
Compatibility Note – UltraSPARC-IIi only supports the interrupt data that were
present in prior UltraSPARC-based systems; that is, bits 10:0 (INR) of ASI_SDB_INTR(0). All other bits are read as 0. Non-privileged access to this register causes a privileged_action trap
11.10.5
Interrupt Vector Receive
Name: ASI_INTR_RECEIVE (Privileged) ASI: 0x49, VA==0
Interrupt Vector Receive Register Format
Field Reserved BUSY MID Use — Set when an interrupt vector is received Always 0 RW R RW R
TABLE 11-8
Bits
BUSY: This bit is set when an interrupt vector is received. MID: Module ID of interrupter. Always 0 on UltraSPARC-IIi.
Note – The BUSY bit must be cleared by software writing zero.
The status of an incoming interrupt can be read from ASI_INTR_RECEIVE. The BUSY bit is cleared by writing a zero to this register. Non-privileged access to this register causes a privileged_action trap.
Chapter 11
Interrupt Handling
123
11.11
Software Interrupt (SOFTINT) Register
In order to schedule interrupt vectors for later processing, each processor can send signals to itself by setting bits in the SOFTINT Register.
TABLE 11-9 Bits
SOFTINT Register Format
Use RW
Field
SOFTINT TICK_INT
When set, bits cause interrupts at levels IRL respectively. Timer interrupt
RW RW
SOFTINT: When set, bits cause interrupts at levels IRL respectively. TICK_INT: When the TICK_CMPR’s INT_DIS field is cleared (that is, the TICK interrupt is enabled) and the 63-bit TICK_Compare Register’s TICK_CMPR field matches the TICK Register’s counter field, the TICK_INT field is set and a software interrupt is generated. See also Section 14.1.8, “TICK Register” on page 185 and Section 14.5.1, “Per-Processor TICK Compare Field of TICK Register” on page 199. The SOFTINT register (ASR 1616) is used for communication from (TL > 0) Nucleus code to (T=0) kernel code. Non privileged accesses to this register will cause a privileged_opcode trap. Interrupt packets and other service requests can be scheduled in queues or mailboxes in memory by the nucleus, which then sets SOFTINT to cause an interrupt at level . Setting SOFTINT is done via a write to the SET_SOFTINT register (ASR 1416) with bit corresponding to the interrupt level set. Note that the value written to the SET_SOFTINT register is effectively ORd into the SOFTINT register. This action allows the interrupt handler to set one or more bits in the SOFTINT register with a single instruction. Read accesses to the SET_SOFTINT register cause an illegal_instruction trap. Non-privileged accesses to this register cause a privileged_opcode trap. When the nucleus returns, if (PSTATE.IE=1) and (PIL of the asserted bits in SOFTINT. The processor then takes a trap for the interrupt request, the nucleus sets the return state to the interrupt handler at that PIL, and returns to TL0. In this manner, the nucleus can schedule services at various priorities and process them according to their priority. When all interrupts scheduled for service at level n have been serviced, the kernel writes to the CLEAR_SOFTINT register (ASR 15 16) with bit n set, to clear that interrupt. Note that the complement of the value written to the CLEAR_SOFTINT register is effectively ANDd with the SOFTINT register. This allows the interrupt
124 UltraSPARC-IIi User’s Manual • October 1997
handler to clear one or more bits in the SOFTINT register with a single instruction. Read accesses to the CLEAR_SOFTINT register cause an illegal_instruction trap. Non privileged write accesses to this register cause a privileged_opcode trap. The timer interrupt TICK_INT is equivalent to SOFTINT and has the same effect.
Note – To avoid a race condition between the kernel clearing an interrupt and the
nucleus setting it, the kernel should reexamine the queue for any valid entries after clearing the interrupt bit.
TABLE 11-10 ASR Value
SOFTINT ASRs
Access Description
ASR Name/Syntax
1416 1516 1616
SET_SOFTINT CLEAR_SOFTINT SOFTINT_REG
W W RW
Set bit(s) in Soft Interrupt register Clear bit(s) in Soft Interrupt register Per-processor Soft Interrupt register
Chapter 11
Interrupt Handling
125
126
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
12
Instruction Set Summary
The UltraSPARC-IIi CPU implements both the standard SPARC-V9 instruction set and a number of implementation-dependent extended instructions. Standard SPARC-V9 instructions are documented in The SPARC Architecture Manual, Version 9. UltraSPARC-IIi extended instructions are documented in Chapter 13, “VIS™ and Additional Instructions.”
TABLE 12-1 lists the complete UltraSPARC-IIi instruction set. A check () in the “Ext”
column indicates that the instruction is an UltraSPARC-IIi extension; the absence of a check indicates a SPARC-V9 core instruction. The “Ref” column lists the section number that contains the instruction documentation. SPARC-V9 core instructions are documented in The SPARC Architecture Manual, Version 9; UltraSPARC-IIi extensions are documented in this manual.
Note – The first printing of The SPARC Architecture Manual, Version 9 contains two
sections numbered A.31; the subsequent sections in Appendix A are misnumbered. For convenience, TABLE 12-1 on page 127 of this manual follows this incorrect numbering scheme. When The SPARC Architecture Manual, Version 9 is corrected, TABLE 12-1 will be changed to match the correct numbering.
TABLE 12-1 Opcode
Complete UltraSPARC-IIi Instruction Set
Ext Ref
Description
ADD (ADDcc) ADDC (ADDCcc) ALIGNADDRESS ALIGNADDRESSL AND (ANDcc)
Add (and modify condition codes) Add with carry (and modify condition codes) Calculate address for misaligned data access Calculate address for misaligned data access (little-endian) And (and modify condition codes)
V9, App.A1 V9, App.A 13.4.5 13.4.5 V9, App.A
127
TABLE 12-1 Opcode
Complete UltraSPARC-IIi Instruction Set (Continued)
Ext Ref
Description
ANDN (ANDNcc) ARRAY{8,16,32} Bicc BLD BPcc BPr BST CALL CASA CASXA DONE EDGE{8,16,32}{L} FABS(s,d,q) FADD(s,d,q) FALIGNDATA FANDNOT1{s} FANDNOT2{s} FAND{s} FBPfcc FBfcc FCMP(s,d,q) FCMPE(s,d,q) FCMPEQ{16,32} FCMPGT{16,32} FCMPLE{16,32} FCMPNE{16,32} FDIV(s,d,q) FdMULq FEXPAND FiTO(s,d,q) FLUSH
And not (and modify condition codes) 3-D address to blocked byte address conversion Branch on integer condition codes 64-byte block load Branch on integer condition codes with prediction Branch on contents of integer register with prediction 64-byte block store Call and link Compare and swap word in alternate space Compare and swap doubleword in alternate space Return from trap Edge boundary processing {little-endian} Floating-point absolute value Floating-point add Perform data alignment for misaligned data Negated src1 AND src2 (single precision) src1 AND negated src2 (single precision) Logical AND (single precision) Branch on floating-point condition codes with prediction Branch on floating-point condition codes Floating-point compare Floating-point compare (exception if unordered) Four 16-bit/two 32-bit compare; set integer dest if src1 = src2 Four 16-bit/two 32-bit compare; set integer dest if src1 > src2 Four 16-bit/two 32-bit compare; set integer dest if src1 src2 Two 32-bit compare; set rd if src1 > src2 Four 16-bit compare; set rd if src1 ≤ src2 Two 32-bit compare; set rd if src1 ≤ src2 Four 16-bit compare; set rd if src1 ≠ src2 Two 32-bit compare; set rd if src1 ≠ src2 Four 16-bit compare; set rd if src1 = src2 Two 32-bit compare; set rd if src1 = src2
10 31 30 29
rd 25 24
11 0110 19 18
rs1 14 13
opf 5 4
rs2 0
FIGURE 13-23
Pixel Compare Instruction Format (3)
TABLE 13-14
Pixel Compare Instruction Syntax
Suggested Assembly Language Syntax
fcmpgt16 fcmpgt32 fcmple16 fcmple32 fcmpne16 fcmpne32 fcmpeq16 fcmpeq32
fregrs1, fregrs2, regrd fregrs1, fregrs2, regrd fregrs1, fregrs2, regrd fregrs1, fregrs2, regrd fregrs1, fregrs2, regrd fregrs1, fregrs2, regrd fregrs1, fregrs2, regrd fregrs1, fregrs2, regrd
Chapter 13
VIS™ and Additional Instructions
159
Description:
Four 16-bit or two 32-bit fixed-point values in rs1 and rs2 are compared. The 4-bit or 2-bit results are stored in the corresponding least significant bits of the integer rd register. Bit zero of rd corresponds to the least significant 16-bit or 32-bit graphics compare result. For FCMPGT, each bit in the result is set if the corresponding value in rs1 is greater than the value in rs2. Less-than comparisons are made by swapping the operands. For FCMPLE, each bit in the result is set if the corresponding value in rs1 is less than or equal to the value in rs2. Greater-than-or-equal comparisons are made by swapping the operands. For FCMPEQ, each bit in the result is set if the corresponding value in rs1 is equal to the value in rs2. For FCMPNE, each bit in the result is set if the corresponding value in rs1 is not equal to the value in rs2.
Traps:
fp_disabled
160
UltraSPARC-IIi User’s Manual • October 1997
13.4.8
Edge Handling Instructions
TABLE 13-15 opcode EDGE8 EDGE8L EDGE16 EDGE16L EDGE32 EDGE32L
Edge Handling Instruction Opcodes
opf operation Eight 8-bit edge boundary processing Eight 8-bit edge boundary processing, little-endian Four 16-bit edge boundary processing Four 16-bit edge boundary processing, little-endian Four 32-bit edge boundary processing Two 32-bit edge boundary processing, little-endian
0 0000 0000 0 0000 0010 0 0000 0100 0 0000 0110 0 0000 1000 0 0000 1010
10 31 30 29
rd 25 24
11 0110 19 18
rs1 14 13
opf 5 4
rs2 0
FIGURE 13-24
Edge Handling Instruction Format (3)
TABLE 13-16
Edge Handling Instruction Syntax
Suggested Assembly Language Syntax
edge8 edge8l edge16 edge16l edge32 edge32l
regrs1, regrs2, regrd regrs1, regrs2, regrd regrs1, regrs2, regrd regrs1, regrs2, regrd regrs1, regrs2, regrd regrs1, regrs2, regrd
Description:
These instructions are used to handle the boundary conditions for parallel pixel scan line loops, where src1 is the address of the next pixel to render and src2 is the address of the last pixel in the scan line.
Chapter 13
VIS™ and Additional Instructions
161
EDGE8L, EDGE16L, and EDGE32L are little-endian versions of EDGE8, EDGE16 and EDGE32. They produce an edge mask that is bit reversed from their big-endian counterparts, but are otherwise the same. This makes the mask consistent with the mask generated by the graphics compare operations (see Section 13.4.7, “Pixel Compare Instructions” on page 159) on little-endian data. A 2- (EDGE32), 4- (EDGE16), or 8-bit (EDGE8) pixel mask is stored in the least significant bits of rd. The mask is computed from left and right edge masks as follows: 1. The left edge mask is computed from the 3 least significant bits (LSBs) of rs1 and the right edge mask is computed from the 3 LSBs of rs2, according to TABLE 13-17 (TABLE 13-18 for little-endian byte ordering). 2. If 32-bit address masking is disabled (PSTATE.AM = 0, 64-bit addressing) and the upper 61 bits of rs1 are equal to the corresponding bits in rs2, rd is set equal to the right edge mask ANDed with the left edge mask. 3. If 32-bit address masking is enabled (PSTATE.AM = 1, 32-bit addressing) is set and the bits of rs1 are equal to the corresponding bits in rs2, rd is set to the right edge mask ANDd with the left edge mask. 4. Otherwise, rd is set to the left edge mask. The integer condition codes are set the same as a SUBCC instruction with the same operands. End of scan line comparison tests may be performed using edge with an appropriate conditional branch instruction.
Traps:
None
TABLE 13-17 Edge Size
Edge Mask Specification
A2..A0 Left Edge Right Edge
8 8 8 8 8 8 8
000 001 010 011 100 101 110
1111 1111 0111 1111 0011 1111 0001 1111 0000 1111 0000 0111 0000 0011
1000 0000 1100 0000 1110 0000 1111 0000 1111 1000 1111 1100 1111 1110
162
UltraSPARC-IIi User’s Manual • October 1997
TABLE 13-17 Edge Size
Edge Mask Specification (Continued)
A2..A0 Left Edge Right Edge
8 16 16 16 16 32 32
111 00x 01x 10x 11x 0xx 1xx
0000 0001 1111 0111 0011 0001 11 01
1111 1111 1000 1100 1110 1111 10 11
TABLE 13-18 Edge Size
Edge Mask Specification (Little-Endian)
A2..A0 Left Edge Right Edge
8 8 8 8 8 8 8 8 16 16 16 16 32 32
000 001 010 011 100 101 110 111 00x 01x 10x 11x 0xx 1xx
1111 1111 1111 1110 1111 1100 1111 1000 1111 0000 1110 0000 1100 0000 1000 0000 1111 1110 1100 1000 11 10
0000 0001 0000 0011 0000 0111 0000 1111 0001 1111 0011 1111 0111 1111 1111 1111 0001 0011 0111 1111 01 11
Chapter 13
VIS™ and Additional Instructions
163
13.4.9
Pixel Component Distance (PDIST)
TABLE 13-19 opcode
Pixel Component Distance Opcode
opf operation
PDIST
0 0011 1110
distance between 8 8-bit components
:
10 31 30 29
FIGURE 13-25
rd 25 24
11 0110 19 18
rs1 14 13
opf 5 4
rs2 0
Pixel Component Distance Format (3)
TABLE 13-20
Pixel Component Distance Syntax
Suggested Assembly Language Syntax
pdist
fregrs1, fregrs2, fregrd
Description:
Eight unsigned 8-bit values are contained in the 64-bit rs1 and rs2 registers. The corresponding 8-bit values in rs1 and rs2 are subtracted (i.e., rs1 – rs2). The sum of the absolute value of each difference is added to the integer in the 64-bit rd register. The result is stored in rd. Typically, this instruction is used for motion estimation in video compression algorithms.
Note – For good performance, the rd operand of PDIST should not reference the
result of a non PDIST instruction in the previous two instruction groups.
Traps:
fp_disabled
164
UltraSPARC-IIi User’s Manual • October 1997
13.4.10
Three-Dimensional Array Addressing Instructions
TABLE 13-21
Three-Dimensional Array Addressing Instruction Opcodes
opf 0 0001 0000 0 0001 0010 0 0001 0100 operation
Convert 8-bit 3-D address to blocked byte address Convert 16-bit 3-D address to blocked byte address Convert 32-bit 3-D address to blocked byte address
opcode
ARRAY8 ARRAY16 ARRAY32
10 31 30 29
rd 25 24
11 0110 19 18
rs1 14 13
opf 5 4
rs2 0
FIGURE 13-26
Three-Dimensional Array Addressing Instruction Format (3)
TABLE 13-22
Three-Dimensional Array Addressing Instruction Syntax
Suggested Assembly Language Syntax
array8 array16 array32
regrs1, regrs2, regrd regrs1, regrs2, regrd regrs1, regrs2, regrd
Description:
These instructions convert three dimensional (3D) fixed-point addresses contained in rs1 to a blocked-byte address; they store the result in rd. Fixed-point addresses typically are used for address interpolation for planar reformatting operations. Blocking is performed at the 64-byte level to maximize external cache block reuse, and at the 64k-byte level to maximize TLB entry reuse, regardless of the orientation of the address interpolation. These instructions specify an element size of 8 (ARRAY8), 16 (ARRAY16) or 32 bits (ARRAY32). The rs2 operand specifies the power-of-two size of the X and Y dimensions of a 3D image array. The legal values for rs2 and their meanings are shown in TABLE 13-23. Illegal values produce undefined results in the rd register.
Chapter 13
VIS™ and Additional Instructions
165
TABLE 13-23
Allowable values for rs2
Number of Elements
rs2 Value
0 1 2 3 4 5
64 128 256 512 1,024 2,048
FIGURE 13-27 shows the format of rs1.
Z integer 63
FIGURE 13-27
Z fraction 55 54 44 43
Y integer 33 32
Y fraction 22 21
X integer 11 10
X fraction 0
Three Dimensional Array Fixed-Point Address Format
The integer parts of X, Y, and Z are converted to the blocked-address formats of FIGURE 13-28, FIGURE 13-29, and FIGURE 13-30, as appropriate.
Upper Z 17 20 + 2 isrc2 + 2 isrc2
FIGURE 13-28
Middle Y 17 + isrc2 X 17 Z 13 Y 9 X 5 Z 4
Lower Y 2 X 0
Three Dimensional Array Blocked-Address Format (Array8)
Upper Z 18 21 + 2 isrc2 + 2 isrc2
FIGURE 13-29
Middle X Z 18 14 Y 10 X 6 Z 5
Lower Y 3 X 1 0 0
Y 18 + isrc2
Three Dimensional Array Blocked-Address Format (Array16)
166
UltraSPARC-IIi User’s Manual • October 1997
Upper Z 22 19 + 2 isrc2 + 2 isrc2
FIGURE 13-30
Middle X Z 19 15 Y 11 X 7 Z 6
Lower Y 4 X 2 00 0
Y 19 + isrc2
Three Dimensional Array Blocked-Address Format (Array32)
The bits above Z upper are set to zero. The number of zeros in the least significant bits is determined by the element size. An element size of eight bits has no zeros, an element size of 16-bits has one zero, and an element size of 32-bits has two zeros. Bits in X and Y above the size specified by rs2 are ignored.
Note – To maximize reuse of E-cache and TLB data, software should block array
references for large images to the 64 kB level. This means processing elements within a 32 x 64 x 64 block. The following code fragment shows assembly of components along an interpolated line at the rate of one component per clock on UltraSPARC-IIi:
CODE EXAMPLE 13-4
Assembly of Components along an Interpolated Line
Addr, DeltaAddr, Addr Addr, %g0, bAddr [bAddr] ASI_FL8_PRIMARY, data data, accum, accum
add array8 ldda faligndata
Traps:
None
Chapter 13
VIS™ and Additional Instructions
167
13.5
13.5.1
Memory Access Instructions
Partial Store Instructions
TABLE 13-24
Partial Store Opcodes
imm_asi ASI Value C016 C116 C816 C916 C216 C316 CA16 CB16 C416 C516 CC16 CD16 Operation
Eight 8-bit conditional stores to primary address space Eight 8-bit conditional stores to secondary address space Eight 8-bit conditional stores to primary address space, little-endian Eight 8-bit conditional stores to secondary address space, little-endian Four 16-bit conditional stores to primary address space Four 16-bit conditional stores to secondary address space Four 16-bit conditional stores to primary address space, little-endian Four 16-bit conditional stores to secondary address space, little-endian Two 32-bit conditional stores to primary address space Two 32-bit conditional stores to secondary address space Two 32-bit conditional stores to primary address space, little-endian Two 32-bit conditional stores to secondary address space, little-endian
Opcod e
STDFA STDFA STDFA STDFA STDFA STDFA STDFA STDFA STDFA STDFA STDFA
ASI_PST8_P ASI_PST8_S ASI_PST8_PL ASI_PST8_SL ASI_PST16_P ASI_PST16_S ASI_PST16_PL ASI_PST16_SL ASI_PST32_P ASI_PST32_S ASI_PST832_P L ASI_PST32_SL
STDFA
11 31 30 29
rd 25 24
11 0111 19 18
rs1
i=0 14 13 12
imm_asi 5 4
rs2 0
FIGURE 13-31
Partial Store Format (3)
168
UltraSPARC-IIi User’s Manual • October 1997
TABLE 13-25
Partial Store Syntax
Suggested Assembly Language Syntax
stda
fregrd, [regrs1] regrs2, imm_asi
Description:
The partial store instructions are selected by using one of the partial store ASIs with the STDA instruction. Two 32-bit, four 16-bit or eight 8-bit values from the 64-bit rd register are conditionally stored at the address specified by rs1 using the mask specified by rs2. The value in rs2 has the same format as the result generated by the pixel compare instructions (see Section 13.4.7, “Pixel Compare Instructions” on page 159). The most significant bit of the mask (not the entire register) corresponds to the most significant part of the rs1 register. The data is stored in little-endian form in memory if the ASI name has a “_LITTLE” suffix; otherwise, it is big-endian.
Note – If the byte ordering is little-endian, the byte enables generated by this
instruction are swapped with respect to big-endian.
Traps:
fp_disabled mem_address_not_aligned data_access_exception PA_watchpoint VA_watchpoint illegal_instruction (when i = 1, no immediate mode is supported. This is not checked if
there is a data_access_exception for a non-STDFA opcode).
Chapter 13
VIS™ and Additional Instructions
169
13.5.2
Short Floating-Point Load and Store Instructions
TABLE 13-26
Short Floating-Point Load and Store Instruction
ASI Value Operation
Opcode
imm_asi
LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA
ASI_FL8_P ASI_FL8_S ASI_FL8_PL ASI_FL8_SL ASI_FL16_P ASI_FL16_S ASI_FL16_P L ASI_FL16_S L
D016 D116 D816 D916 D216 D316 DA16 DB16
8-bit load/store from/to primary address space 8-bit load/store from/to secondary address space 8-bit load/store from/to primary address space, littleendian 8-bit load/store from/to secondary address space, littleendian 16-bit load/store from/to primary address space 16-bit load/store from/to secondary address space 16-bit load/store from/to primary address space, littleendian 16-bit load/store from/to secondary address space, littleendian
11
rd
11 0011
rs1
i=0
imm_asi
rs2
11 31 30 29
rd 25 24
11 0011 19 18
rs1
i=1 14 13 12
simm_13 5 4 0
TABLE 13-27
Format (3) LDDFA
11
rd
11 0111
rs1
i=0
imm_asi
rs2
11 31 30 29
rd 25 24
11 0111 19 18
rs1
i=1 14 13 12
simm_13 5 4 0
TABLE 13-28
Format (3) STDFA
170
UltraSPARC-IIi User’s Manual • October 1997
TABLE 13-29
Short Floating-Point Load and Store Instruction Syntax
Suggested Assembly Language Syntax
ldda ldda stda stda
[reg_addr] imm_asi, fregrd [reg_plus_imm] %asi, fregrd fregrd, [reg_addr] imm_asi fregrd, [reg_plus_imm] %asi
Description:
Short floating-point load and store instructions are selected by using one of the short ASIs with the LDDA and STDA instructions. These ASIs allow 8- and 16-bit loads or stores to be performed to the floating-point registers. Eight-bit loads can be performed to arbitrary byte addresses. For sixteen bit loads, the least significant bit of the address must be zero, or a mem_not_aligned trap is taken. Short loads are zero-extended to the full floating point register. Short stores access the low order 8 or 16 bits of the register. Little-endian ASIs transfer data in little-endian format in memory; otherwise, memory is assumed to big-endian. Short loads and stores typically are used with the FALIGNDATA instruction (see Section 13.4.5, “Alignment Instructions” on page 154) to assemble or store 64 bits of non-contiguous components.
Traps:
fp_disabledPA_watchpoint VA_watchpoint mem_address_not_aligned (Checked for opcode implied alignment if the opcode is not
LDFA or STDFA)
Chapter 13
VIS™ and Additional Instructions
171
13.5.3
Block Load and Store Instructions
TABLE 13-30 Opcode
Block Load and Store Instruction Opcodes
imm_asi ASI Value Operation
LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA LDDFA STDFA STDFA STDFA
ASI_BLK_AIUP ASI_BLK_AIUS ASI_BLK_AIUPL ASI_BLK_AIUSL ASI_BLK_P ASI_BLK_S ASI_BLK_PL ASI_BLK_SL ASI_BLK_COMMIT_P ASI_BLK_COMMIT_S
7016 7116 7816 7916 F016 F116 F816 F916 E016 E116
64-byte block load/store from/ to primary address space, user privilege 64-byte block load/store from/ to secondary address space, user privilege 64-byte block load/store from/ to primary address space, user privilege, little-endian 64-byte block load/store from/ to secondary address space, user privilege, little-endian 64-byte block load/store from/to primary address space 64-byte block load/store from/ to secondary address space 64-byte block load/store from/to primary address space, little-endian 64-byte block load/store from/to secondary address space, little-endian 64-byte block commit store to primary address space 64-byte block commit store to secondary address space
11
rd
11 0011
rs1
i=0
imm_asi
rs2
11 31 30 29
rd 25 24
11 0011 19 18
rs1
i=1 14 13 12
simm_13 5 4 0
FIGURE 13-32
Format (3) LDDFA:
172
UltraSPARC-IIi User’s Manual • October 1997
11
rd
11 0111
rs1
i=0
imm_asi
rs2
11 31 30 29
rd 25 24
11 0111 19 18
rs1
i=1 14 13 12
simm_13 5 4 0
FIGURE 13-33
Format (3) STDFA:
TABLE 13-31
Block Load and Store Instruction Syntax
Suggested Assembly Language Syntax
ldda ldda stda stda
[reg_addr] imm_asi, fregrd [reg_plus_imm] %asi, fregrd
fregrd, [reg_addr] imm_asi fregrd, [reg_plus_imm] %asi
Description:
Block load and store instructions are selected by using one of the block transfer ASIs with the LDDA and STDA instructions. These ASIs allow block loads or stores to be performed to the same address spaces as normal loads and stores. Little-endian ASIs access data in little-endian format, otherwise the access is assumed to be big-endian. The byte swapping is performed separately for each of the eight double-precision registers used by the instruction. Endianness does not matter if these instructions are being used for block copy. Block stores with commit force the data to be written to memory and invalidate copies in all caches, if present. As a result, block commit stores maintain coherency with the I-cache unlike other stores. They do not, however, flush instructions that have already been fetched into the pipeline. Execute a FLUSH, DONE, or RETRY instruction to flush the pipeline before executing the modified code. LDDA with a block transfer ASI loads 64 bytes of data from a 64-byte aligned memory area into eight double-precision floating-point registers specified by fregrd. The lowest addressed eight bytes in memory are loaded into the lowest numbered double-precision rd register. An illegal_instruction trap is taken if the floating-point registers are not aligned on an eight-double-precision register boundary. The least significant 6 bits of the address must be zero or a mem_address_not_aligned trap is taken.
Chapter 13
VIS™ and Additional Instructions
173
STDA with a block transfer ASI stores data from eight double-precision floatingpoint registers specified by rs1 to a 64 byte aligned memory area. The lowest addressed eight bytes in memory are stored from the lowest numbered double precision freg. An illegal_instruction trap is taken if the floating-point registers are not aligned on an eight register boundary. The least significant 6 bits of the address must be zero, or a mem_address_not_aligned trap is taken.
Traps:
fp_disabled illegal_instruction (nonaligned rd. Not checked if opcode is not LDFA or STDFA) data_access_exception mem_address_not_aligned (Checked for opcode implied alignment if the opcode is not
LDFA or STDFA)
PA_watchpoint VA_watchpoint
Note – These instructions are used for transferring large blocks of data (more than
256 bytes); for example, BCOPY and BFILL. On UltraSPARC-IIi they do not allocate in the D-cache or E-cache on a miss. UltraSPARC-IIi updates the E-cache on a hit. UltraSPARC-IIi allows one BLD and two BSTs to be outstanding on the interconnect at one time. To simplify the implementation, BLD destination registers may or may not interlock like ordinary load instructions. Before referencing the block load data, a second BLD (to a different set of registers) or a MEMBAR #Sync must be performed. If a second BLD is used to synchronize with returning data, then UltraSPARC-IIi continues execution before all data has been returned. The lowest number register being loaded may be referenced in the first instruction group following the second BLD, the second lowest number register may be referenced in the second group, and so on. If this rule is violated, data from before or after the load may be returned. When making this count of of instruction groups, only groups containing floatingpoint instructions should be considered. Similarly, BST source data registers are not interlocked against completion of previous load instructions (even if a second BLD has been performed). The previous load data must be referenced by some other intervening instruction, or an intervening MEMBAR #Sync must be performed. If the programmer violates these rules, data from before or after the load may be used. UltraSPARC-IIi continues execution before all of the store data has been transferred. If store data registers are overwritten before the next block store or MEMBAR #Sync instruction, then the
174 UltraSPARC-IIi User’s Manual • October 1997
following rule must be observed. The first register can be overwritten in the same instruction group as the BST, the second register can be overwritten in the instruction group following the block store and so on. If this rule is violated, the store may store correct data or the overwritten data. There must be a MEMBAR #Sync or a trap following a BST before executing a DONE, RETRY, or WRPR to PSTATE instruction. If this is rule is violated, instructions after the DONE, RETRY, or WRPR to PSTATE may not see the effects of the updated PSTATE. BLD does not follow memory model ordering with respect to stores. In particular, read-after-write and write-after-read hazards to overlapping addresses are not detected. The side effects bit associated with the access is ignored (see Section 15.2, “Translation Table Entry (TTE)” on page 205). If ordering with respect to earlier stores is important (for example, a block load that overlaps previous stores), then there must be an intervening MEMBAR #StoreLoad or stronger MEMBAR. If ordering with respect to later stores is important (e.g. a block load that overlaps a subsequent store), then there must be an intervening MEMBAR #LoadStore or reference to the block load data. This restriction does not apply when a trap is taken, so the trap handler need not consider pending block loads. If the BLD overlaps a previous or later store and there is no intervening MEMBAR, trap, or data reference, the BLD may return data from before or after the store.
Compatibility Note – Prior UltraSPARCs may have provided the first two registers at the same time. If code depends upon this unsupported behavior it must be modified for UltraSPARC-IIi.
BST does not follow memory model ordering with respect to loads, stores or flushes. In particular, read-after-write, write-after-write, flush after write and write-after-read hazards to overlapping addresses are not detected. The side effects bit associated with the access is ignored. If ordering with respect to earlier or later loads or stores is important then there must be an intervening reference to the load data (for earlier loads), or appropriate MEMBAR instruction. This restriction does not apply when a trap is taken, so the trap handler does not have to worry about pending block stores. If the BST overlaps a previous load and there is no intervening load data reference or MEMBAR #LoadStore instruction, the load may return data from before or after the store and the contents of the block are undefined. If the BST overlaps a later load and there is no intervening trap or MEMBAR #StoreLoad instruction, the contents of the block are undefined. If the BST overlaps a later store or flush and there is no intervening trap or MEMBAR #StoreStore instruction, the contents of the block are undefined. Block load and store operations do not obey the ordering restrictions of the currently selected processor memory model (TSO, PSO, or RMO); block operations always execute under an RMO memory ordering model. Explicit MEMBAR instructions are required to order block operations among themselves or with respect to normal
Chapter 13
VIS™ and Additional Instructions
175
loads and stores. In addition, block operations do not conform to dependence order on the issuing processor; that is, no read-after-write or writer-after-read checking occurs between block loads and stores. Explicit MEMBARs are required to enforce dependence ordering between block operations that reference the same address. Typically, BLD and BST are used in loops where software can ensure that there is no overlap between the data being loaded and the data being stored. The loop is preceded and followed by the appropriate MEMBARs to ensure that there are no hazards with loads and stores outside the loops. CODE EXAMPLE 13-5 on page 177 illustrates the inner loop of a byte-aligned block copy operation. Note that the loop must be unrolled twice to achieve maximum performance. All FP registers are double-precision. Eight versions of this loop are needed to handle all the cases of double word misalignment between the source and destination.
176
UltraSPARC-IIi User’s Manual • October 1997
CODE EXAMPLE 13-5
Byte-Aligned Block Copy Inner Loop
loop: faligndata faligndata faligndata faligndata faligndata faligndata faligndata addcc bg,pt fmovd %f0, %f2, %f34 %f2, %f4, %f36 %f4, %f6, %f38 %f6, %f8, %f40 %f8, %f10, %f42 %f10, %f12, %f44 %f12, %f14, %f46 l0, -1, l0 l1 %f14, %f48
(end of loop handling) l1:ldda stda faligndata faligndata faligndata faligndata faligndata faligndata faligndata faligndata addcc be,pnt fmovd ldda stda ba faligndata [regaddr] ASI_BLK_P, %f0 %f32, [regaddr] ASI_BLK_P %f48, %f16, %f32 %f16, %f18, %f34 %f18, %f20, %f36 %f20, %f22, %f38 %f22, %f24, %f40 %f24, %f26, %f42 %f26, %f28, %f44 %f28, %f30, %f46 l0, -1, l0 done %f30, %f48 [regaddr] ASI_BLK_P, %f16 %f32, [regaddr] ASI_BLK_P loop %f48, %f0, %f32
done: (end of loop processing)
Chapter 13
VIS™ and Additional Instructions
177
13.6
13.6.1
Additional Instructions
Atomic Quad Load
TABLE 13-32
Atomic Quad Load Opcodes
imm_asi ASI_NUCLEUS_QUAD_LDD ASI_NUCLEUS_QUAD_LDD_L ASI Value 2416 2C16 Operation
128-bit atomic load 128-bit atomic load, little endian
Opcode
LDDA LDDA
:
11 rd 01 0011 rs1 i=0 imm_asi rs2
11 31 30 29
rd 25 24
01 0011 19 18
rs1
i=1 14 13 12
simm_13 5 4 0
FIGURE 13-34
Format (3) LDDA
Atomic Quad Load Syntax
TABLE 13-33
Suggested Assembly Language Syntax
ldda ldda [reg_addr] imm_asi, regrd [reg_plus_imm] %asi, regrd
Description:
These ASIs are used with the LDDA instruction to atomically read a 128-bit data item. They are intended to be used by the TLB miss handler to access TSB entries without requiring locks. The data is placed in an even/odd pair of 64-bit integer registers. The lowest address 64-bits is placed in the even register; the highest address 64-bits is placed in the odd register. The reference is made from the nucleus context. In addition to the usual traps for LDDA using a privileged ASI, a data_access_exception trap is taken for a noncacheable access, or use with any instruction other than LDDA. A mem_address_not_aligned trap is taken if the access is not aligned on a 128-bit boundary.
178
UltraSPARC-IIi User’s Manual • October 1997
Traps:
fp_disabled PA_watchpoint VA_watchpoint mem_address_not_aligned (Checked for opcode implied alignment if the opcode is not
LDFA or STDFA)
data_access_exception
13.6.2
SHUTDOWN
TABLE 13-34 opcode
SHUTDOWN Opcode
opf operation
SHUTDOWN
0 1000 0000
Shutdown to enter power down mode
10 31 30 29
— 25 24
11 0110 19 18
— 14 13
opf 5 4
— 0
FIGURE 13-35
SHUTDOWN Instruction Format (3)
TABLE 13-35
SHUTDOWN Syntax
Suggested Assembly Language Syntax
shutdown
Description:
The EPA Energy Star specification requires a system standby power consumption of less than 30 W (excluding the system monitor).
Chapter 13
VIS™ and Additional Instructions
179
To enter SHUTDOWN mode, UltraSPARC-IIi software saves everything to disk and the power supply is turned off. A timer turns the power back on after 30 minutes. UltraSPARC-IIi does not support the full feature set of some earlier PCI-based UltraSPARC systems, principally to avoid the circuit complexity of maintaining memory refresh while the processor is shut down. Invoking the SHUTDOWN instruction causes all processor, cache and memory state to be lost. A power-on reset (POR) must be used restart the processor. A status bit indicates the reason for the POR. This instruction stops all internal clocks, achieving the lowest possible power consumption while the power supply is on. To leave the system and external cache interface in a clean state, the SHUTDOWN instruction waits for all outstanding transactions to be completed before sending a shutdown signal to the internal clock generator. The internal clock generator asserts the internal reset for 19 clocks to force the chip into a safe state, and then stops the internal clock and the PLL. The internal clock is left in the high state. All external signals should be left in the normal reset state. An external power-down signal (EPD) is activated by the clock generator at the same time as the internal reset. This signal is used to put the E-cache RAMs in standby mode. This is a privileged instruction; an attempt to execute it while in non-privileged mode causes a privileged_opcode trap.
Compatibility Note – When the processor is reset, UPA64S, PCI, and APB are also
reset.
Traps:
privileged_opcode
180
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
14
Implementation Dependencies
14.1
14.1.1
SPARC-V9 General Information
Level-2 Compliance (Impdep #1)
UltraSPARC-IIi is designed to meet Level-2 SPARC-V9 compliance. It
s s
Correctly interprets all non-privileged operations, and Correctly interprets all privileged elements of the architecture.
Note – System emulation routines (for example, quad-precision floating-point
operations) shipped with UltraSPARC-IIi also must be Level-2 compliant.
14.1.2
Unimplemented Opcodes, ASIs, and ILLTRAP
SPARC-V9 unimplemented, reserved, ILLTRAP opcodes, and instructions with invalid values in reserved fields (other than reserved FPops or fields in graphics instructions that reference floating-point registers and the reserved field in the Tcc instruction) encountered during execution cause an illegal_instruction trap. The reserved field in the Tcc instruction is not checked because SPARC-V8 did not reserve this field. Reserved FPops and invalid values in reserved fields in graphics instructions that reference floating-point registers cause an fp_exception_other (with FSR.ftt=unimplemented_FPop) trap. Unimplemented and reserved ASI values cause a data_access_exception trap.
181
14.1.3
Trap Levels (Impdep #37, 38, 39, 40, 114, 115)
UltraSPARC-IIi supports five trap levels; that is, MAXTL=5. Normal execution is at TL0. Traps at MAXTL –1 cause the CPU to enter RED_state. If a trap is generated while the CPU is operating at TL = MAXTL, the CPU will enter error_state and generate a Watchdog Reset (WDR). CWP updates for window traps that cause entry to error_state are the same as when error_state is not entered. A processor normally executes at trap level 0 (execute_state, TL0). The trap handling mechanism in SPARC-V9 differs from SPARC-V8 when a trap or error condition is encountered at TL0. In SPARC-V8, the CPU enters trap state and system (privileged) software must save enough processor state to guarantee that any error condition detected while in the trap handler will not put the CPU into error_state (that is, cause a reset). Then the trap routine is entered to process the erroneous condition. Upon completion of trap processing, the state of the CPU is restored before returning to the offending code or terminating the process. This time-consuming operation is necessary because SPARC-V8 does not support nested traps. In SPARC-V9, a trap makes the CPU enter the next higher trap level, which is a very fast and efficient process because there is one set of trap state registers for each trap level. After saving the most important machine states (PC, next PC, PSTATE) on the trap stack at this level, the trap (or error) condition is processed. For a complete description of traps and RED_state handling, see Section 17.4, “Machine State after Reset and in RED_state” on page 272.
Note – The RED_state trap vector address (RSTVaddr) is 256 MB below the top of
the virtual address space; this is, at virtual address FFFF FFFF F000 0000 16, which is passed through to physical address 1FF F000 0000 16 in RED_state. UltraSPARC-IIi has a second RSTV available — see “RED_state Trap Vector” on page 271.
14.1.4
Alternate RSTV support
UltraSPARC-IIi has a pin to select a second RSTV to allow use of PC compatible SuperIO chips on a PCI bus. See Section 17.2.7.3, “Reset_Control Register (0x1FE.0000.F020)” on page 267 and Section 17.3.2, “RED_state Trap Vector” on page 271.
182
UltraSPARC-IIi User’s Manual • October 1997
14.1.5
Trap Handling (Impdep #16, 32, 33, 35, 36, 44)
UltraSPARC-IIi supports precise trap handling for all operations except for deferred or disrupting traps from hardware failures encountered during memory accesses. These failures are discussed in Section 16.2, “Deferred Errors” on page 240 and Section 16.3, “Disrupting Errors” on page 242. UltraSPARC-IIi implements precise traps, interrupts, and exceptions for all instructions, including long latency floatingpoint operations. Five traps levels are supported, which allows graceful recovery from faults. The trap levels are shown in FIGURE 14-1. UltraSPARC-IIi can efficiently execute kernel code even in the event of multiple nested traps, promoting processor efficiency while dramatically reducing the system overhead needed for trap handling. Three sets of alternate globals are selected for different kinds of traps:
s s s
MMU globals for memory faults Interrupt globals, and Alternate globals for all other exceptions.
This further increases OS performance, providing fast trap execution by avoiding the need to save and restore registers while processing exceptions.
Level 0: Normal Program Execution
Level 1: System Calls, Interrupt Handlers, Emulation
Level 2: Exceptions in Common OS Routines
Level 3: Page Fault Handlers
Level 4: RED_state Handler
FIGURE 14-1
Nested Trap Levels
All traps supported in UltraSPARC-IIi are listed in TABLE 6-12 on page 56.
14.1.6
SIGM Support (Impdep #116)
UltraSPARC-IIi initiates a Software-Initiated Reset (SIR) by executing a SIGM instruction while in privileged mode. When in non-privileged mode, SIGM behaves as a NOP. See also Section 17.2.3, “Watchdog Reset (WDR) and error_state” on page 263.
Chapter 14
Implementation Dependencies
183
14.1.7
44-bit Virtual Address Space
UltraSPARC-IIi supports a 44-bit subset of the full 64-bit virtual address space. Although the full 64 bits are generated and stored in integer registers, legal addresses are restricted to two equal halves at the extreme lower and upper portions of the full virtual address space. Virtual addresses between 0000 0800 0000 0000 16 and FFFF F7FF FFFF FFFF16 inclusive lie within a “VA Hole,” are termed “out-ofrange,” and are illegal. Prior UltraSPARC implementations introduced the additional restriction on software to not use pages within 4 GB of the VA hole as instruction pages to avoid problems with prefetching into the VA hole. UltraSPARC-IIi assumes that this convention is followed for similar reasons. Note that there are no trap mechanisms to detect a violation of this convention. Address translation and MMU related descriptions can be found in Section 4.2, “Virtual Address Translation” on page 23.
FFFF FFFF FFFF FFFF
See Note (1)
FFFF F801 0000 0000 FFFF F800 0000 0000 FFFF F7FF FFFF FFFF
Out of Range VA (VA “Hole”)
0000 0800 0000 0000 0000 07FF FFFF FFFF 0000 07FE FFFF FFFF
See Note (1)
0000 0000 0000 0000
Note (1): Prior implementations restricted use of this region to data only.
FIGURE 14-2
UltraSPARC-IIi’s 44-bit Virtual Address Space, with Hole (Same as FIGURE 4-2 on page 25)
Note – Throughout this document, when virtual address fields are specified as 64bit quantities, they are assumed to be sign-extended based on VA. A number of state registers are affected by the reduced virtual address space. TBA, TPC, TNPC, VA and PA watchpoint, and DMMU SFAR registers are 44-bits, signextended to 64-bits on read accesses. No checks are done when these registers are written by software. It is the responsibility of privileged software to properly update these registers.
184
UltraSPARC-IIi User’s Manual • October 1997
An out of range address during an instruction access causes an instruction_access_exception trap if PSTATE.AM is not set. If the target address of a JMPL or RETURN instruction is an out-of-range address and PSTATE.AM is not set, a trap is generated with the PC = the address of the JMPL or RETURN instruction and the trap type in the I-MMU SFSR register. This instruction_access_exception trap is lower priority than other traps on the JMPL or RETURN (illegal_instruction due to nonzero reserved fields in the JMPL or RETURN, mem_address_not_aligned trap, or window_fill trap), because it really applies to the target. The trap handler can determine the out-of-range address by decoding the JMPL instruction from the code. All other control transfer instructions trap on the PC of the target instruction along with different status in the I-MMU SFSR register. Because the PC is sign-extended to 64 bits, the trap handler must adjust the PC value to compute the faulting address by XORing ones into the upper 20 bits. See also Section 15.9.4, “I-/D-MMU Synchronous Fault Status Registers (SFSR)” on page 223 and Section 15.9.4, “I-/DMMU Synchronous Fault Status Registers (SFSR)” on page 223. When a trap occurs on the delay slot of a taken branch or call whose target is out-ofrange, or the last instruction below the VA hole, UltraSPARC-IIi records the fact that nPC points to an out of range instruction. If the trap handler executes a DONE or RETRY without saving nPC, the instruction_access_exception trap is taken when the instruction at nPC is executed. If nPC is saved and subsequently restored by the trap handler, the fact that nPC points to an out of range instruction is lost. To guarantee that all out of range instruction accesses cause traps, software should not map addresses within 231 bytes of either side of the VA hole as executable. An out of range address during a data access results in a data_access_exception trap if PSTATE.AM is not set. Because the D-MMU SFAR contains only 44 bits, the trap handler must decode the load or store instruction if the full 64-bit virtual address is needed. See also Section 15.9.4, “I-/D-MMU Synchronous Fault Status Registers (SFSR)” on page 223 and Section 15.9.5, “I-/D-MMU Synchronous Fault Address Registers (SFAR)” on page 225.
14.1.8
TICK Register
UltraSPARC-IIi implements a 63-bit TICK counter. For the state of this register at reset, see TABLE 17-3 on page 272.
Chapter 14
Implementation Dependencies
185
TABLE 14-1 Bits
TICK Register Format
Field Use RW
NPT counter
Non-privileged Trap enable Elapsed CPU clock cycle counter
RW RW
NPT: Non-privileged Trap enable. If set, an attempt by non-privileged software to read the TICK register causes a privileged_action trap. If clear, nonprivileged software can read this register with the RDTICK instruction. This register can only be written by privileged software. A write attempt by nonprivileged software causes a privileged_action trap. counter: 63-bit elapsed CPU clock cycle counter.
Note – TICK.NPT is set and TICK.counter is cleared after both a Power-On-Reset
(POR) and an Externally Initiated Reset (XIR).
14.1.9
Population Count Instruction (POPC)
The population count instruction is emulated in software rather that being executed in hardware.
14.1.10
Secure Software
To establish an enhanced security environment, it may be necessary to initialize certain processor states between contexts. Examples of such states are the contents of integer and floating-point register files, condition codes, and state registers. See also Section 14.2.2, “Clean Window Handling (Impdep #102).
14.1.11
Address Masking (Impdep #125)
When PSTATE.AM=1, the CALL, JMPL, and RDPC instructions and all traps transmit zero in the high-order 32-bits of the PC to their specified destination registers.
186
UltraSPARC-IIi User’s Manual • October 1997
14.2
14.2.1
SPARC-V9 Integer Operations
Integer Register File and Window Control Registers (Impdep #2)
UltraSPARC-IIi implements an eight window 64-bit integer register file; that is, NWINDOWS = 8. UltraSPARC-IIi truncates values stored in the CWP, CANSAVE, CANRESTORE, CLEANWIN, and OTHERWIN registers to three bits. This includes implicit updates to these registers by SAVE(D) and RESTORE(D) instructions. The upper two bits of these registers read as zero.
14.2.2
Clean Window Handling (Impdep #102)
SPARC-V9 introduced the concept of “clean window” to enhance security and integrity during program execution. A clean window is defined to be a register window that contains either all zeroes or addresses and data that belong to the current context. The CLEANWIN register records the number of available clean windows. When a SAVE instruction requests a window, and there are no more clean windows, a clean_window trap is generated. System software must then initialize all registers in the next available window, or windows, to zero before returning to the requesting context.
14.2.3
Integer Multiply and Divide
Integer multiplications (MULScc, SMUL{cc}, MULX) and divisions (SDIV{cc}, UDIV{cc}, UDIVX) are executed directly in hardware. Multiplications are done 2 bits at a time with early exit when the final result is generated. Divisions use a 1-bit non-restoring division algorithm.
Note – For best performance, the smaller of the two operands of a multiply should
be the rs1 operand.
Chapter 14
Implementation Dependencies
187
14.2.4
Version Register (Impdep #2, 13, 101, 104)
Consult the product data sheet for the content of the Version Register for an implementation. For the state of this register after resets, see TABLE 17-3 on page 272.
Version Register Format
Field Use RW
TABLE 14-2 Bits
manuf impl mask Reserved maxtl Reserved maxwin
Manufacturer identification Implementation identification Mask set version — Maximum trap level supported — Maximum number of windows of integer register file.
R R R R R R R
manuf: 16-bit manufacturer code, 001716 (TI JEDEC number), that identifies the manufacturer of an UltraSPARC-IIi CPU. impl:1 6-bit implementation code, 0010 16, that uniquely identifies an UltraSPARC-IIi-class CPU. TABLE 14-3 shows the VER.impl values for each UltraSPARC-IIi model.
VER.impl Values by UltraSPARC-IIi Model
UltraSPARC-I UltraSPARC-II
TABLE 14-3
VER.impl
001016
001116
mask: 8-bit mask set revision number that identifies the mask set revision of this UltraSPARC-IIi. This is subdivided into a 4 bit major mask number and a 4bit minor mask number . The major number starts at zero and is incremented for each all-layer mask revision. The minor number starts at zero for each major revision, and is incremented for each less-than-all-layer mask revision. maxtl: Maximum number of supported trap levels beyond level 0; the same as the largest possible value for the TL register; for UltraSPARC-IIi, maxtl = 5 maxwin: Maximum index number available for use as a valid CWP value. The value is NWINDOWS–1; for UltraSPARC-IIi maxwin = 7.
188
UltraSPARC-IIi User’s Manual • October 1997
14.3
14.3.1
SPARC-V9 Floating-Point Operations
Subnormal Operands & Results; Non-standard Operation
UltraSPARC-IIi handles some cases of subnormal operands or results directly in hardware and traps on the rest. In the trapping cases, an fp_exception_other (with FSR.ftt=2, unfinished_FPop) trap is signalled and these operations are handled in system software. The unfinished trapping cases are listed in TABLE 14-4, and TABLE 14-5. Because trapping on subnormal operands and results can be costly, UltraSPARC-IIi supports the non-standard result option of the SPARC-V9 architecture. If FSR.NS = 1, subnormal operands or results encountered in trapping cases are flushed to zero and the unfinished_FPop floating-point trap type are not taken.
14.3.1.1
Subnormal Operands
If FSR.NS=1, the subnormal operands of these operations are replaced by zeroes with the same sign. An inexact exception is signalled in this case, which causes an fp_exception_ieee_754 trap if enabled by FSR.TEM. If FSR.NS=0, subnormal operands generate traps according to TABLE 14-4 on page 189. ER is the biased exponent of the result before rounding.
Subnormal Operand Trapping Cases (NS=0)
One Subnormal Operand Two Subnormal Operands
TABLE 14-4
Operations
F(sd)TO(ix) F(sd)TO(ds) FSQRT(sd) FADD/SUB(sd) FSMULD FMUL(sd) FDIV(sd)
— Unfinished trap always Unfinished trap always Unfinished trap always
Unfinished trap always Unfinished trap if no overflow and: -25
Reserved fcc3 fcc2 fcc1 RD u TEM NS Reserved ver ftt qne u fcc0 aexc cexc
— Floating-point condition code (set 3) Floating-point condition code (set 2) Floating-point condition code (set 1) Rounding direction Unused IEEE-754 trap enable mask Non-standard floating-point results — FPU version number Floating-point trap type Floating-point deferred-trap queue (FQ) not empty Unused Floating-point condition code (set 0) Accumulated outstanding exceptions Current outstanding exceptions
R RW RW RW RW R RW R R R RW RW R RW RW RW
u: Unused field, read as 0.
Note – The LD{X}FSR instruction should write zeroes to the u fields; undefined
values (read as 0) of these fields are stored by the ST{X}FSR instruction. fcc3, fcc2, fcc1, fcc0: Four sets of 2-bit floating-point condition codes, which are modified by the FCMP{E} (and LD{X}FSR) instructions. The FBfcc, FMOVcc, and MOVcc instructions use one of these condition code sets to determine conditional control transfers and conditional register moves.
Chapter 14 Implementation Dependencies 193
Note – fcc0 is the same as the fcc in SPARC-V8.
RD: IEEE Std. 754-1985 Rounding Direction.
TABLE 14-8 RD
Floating-Point Rounding Modes
Round Toward
0 1 2 3
Nearest (even if tie) 0 +∞ –∞
TEM: 5-bit trap enable mask for the IEEE-754 floating-point exceptions. If a floatingpoint operate instruction produces one or more exceptions, the corresponding cexc/ aexc bits are set and an fp_exception_ieee_754 (with FSR.ftt=1, IEEE_754_exception) exception is generated. NS: When this field = 0, UltraSPARC-IIi produces IEEE-754 compatible results. In particular, subnormal operands or results may cause a trap. When this field=1, UltraSPARC-IIi may deliver a non-IEEE-754 compatible result. In particular, subnormal operands and results may be flushed to zero. See TABLE 14-4 and TABLE 14-5 on page 190. ver: his field identifies a particular implementation of the UltraSPARC-IIi FPU architecture. ftt: The 3-bit floating point trap type field is set whenever an floating-point instruction causes the fp_exception_ieee_754 or fp_exception_other traps.
194
UltraSPARC-IIi User’s Manual • October 1997
TABLE 14-9 ftt
Floating-Point Trap Type Values
Floating-Point Trap Type Trap Signalled
0 1 2 3 4 5 6 7
None IEEE_754_exception unfinished_FPop unimplemented_FPop sequence_error hardware_error invalid_fp_register reserved
— fp_exception_ieee_754 fp_exception_other fp_exception_other fp_exception_other — — —
Note – UltraSPARC-IIi neither detects nor generates the hardware_error or
invalid_fp_register trap types directly in hardware.
Note – UltraSPARC-IIi does not contain an FQ. An attempt to read the FQ with a
RDPR instruction causes an illegal_instruction trap.
Note – SPARC-V8-compatible programs should set the least significant bit of the
floating-point register number to zero for all double-precision instructions. Violation of this SPARC-V8 architectural constraint may result in unexpected program behavior. qne: This bit is not used, because UltraSPARC-IIi implements precise floating-point exceptions. aexc: 5-bit accrued exception field accumulates IEEE 754 exceptions while floatingpoint exception traps are disabled (that is, FSR.TEM=0). cexc: 5-bit current exception field indicates the most recently generated IEEE 754 exceptions.
Chapter 14
Implementation Dependencies
195
14.4
14.4.1
SPARC-V9 Memory-Related Operations
Load/Store Alternate Address Space (Impdep #5, 29, 30)
Supported ASI accesses are listed in Section 6.3, “Alternate Address Spaces” on page 39.
14.4.2
Load/Store ASR (Impdep #6,7,8,9, 47, 48)
Supported ASRs are listed in Section 6.5, “Ancillary State Registers” on page 52.
14.4.3
MMU Implementation (Impdep #41)
UltraSPARC-IIi memory management is based on software-managed instruction and data Translation Lookaside Buffers (TLBs) and in-memory Translation Storage Buffers (TSBs) backed by a Software Translation Table. See Chapter 4, “Overview of I and D-MMUs for more details.
14.4.4
FLUSH and Self-Modifying Code (Impdep #122)
FLUSH is needed to synchronize code and data spaces after code space is modified during program execution. FLUSH is described in Section 8.3.2, “Memory Synchronization: MEMBAR and FLUSH” on page 72. On UltraSPARC-IIi, the FLUSH effective address is translated by the D-MMU. As a result, FLUSH can cause a data_access_exception (the page is mapped with side effects or no fault only bits set, virtual address out of range, or privilege violation) or a data_access_MMU_miss trap. For a data_access_exception, the trap handler can decode the FLUSH instruction, and perform a Done to be consistent with the normal SPARC-V9 behavior of no traps on FLUSH. For a data_access_MMU_miss, the trap handler should do the normal TLB miss processing and perform a RETRY if the page can be mapped in the TLB, otherwise perform a DONE.
196
UltraSPARC-IIi User’s Manual • October 1997
Note – SPARC-V9 specifies that the FLUSH instruction has no latency on the issuing processor. In other words, a store to instruction space prior to the FLUSH instruction is visible immediately after the completion of FLUSH. MEMBAR #StoreStore is required to ensure proper ordering in multi-processing system when the memory model is not TSO. When a MEMBAR #StoreStore, FLUSH sequence is performed, UltraSPARC-IIi guarantees that earlier code modifications will be visible across the whole system.
14.4.5
PREFETCH{A} (Impdep #103, 117)
For UltraSPARC-I, PREFETCH{A} instructions with fcn=0..4 are treated as NOPs. For UltraSPARC-II, PREFETCH{A} instructions with fcn=0..4 have the following meanings:
PREFETCH{A} Variants (UltraSPARC-II)
Action
TABLE 14-10 fcn
Prefetch Function
0 1 4 2 3
Prefetch for several reads Prefetch for one read Prefetch page Prefetch for several writes Prefetch for one write Generate P_RDO_REQ if desired line is not present in E-cache in either E or M state Generate P_RDS_REQ if desired line is not present in E-cache
PREFETCH{A} instructions with fcn=5..15 cause an illegal_instruction trap. PREFETCH{A} instructions with fcn=16..31 are treated as NOPs.
14.4.6
Non-faulting Load and MMU Disable (Impdep #117)
When the data MMU is disabled, accesses are assumed to be non-cacheable (TTE.PC=0) and with side-effect (TTE.E=1). Non-faulting loads encountered when the MMU is disabled cause a data_access_exception trap with SFSR.FT=2 (speculative load to page with side-effect attribute).
Chapter 14
Implementation Dependencies
197
14.4.7
LDD/STD Handling (Impdep #107, 108)
LDD and STD instructions are directly executed in hardware.
Note – LDD/STD are deprecated in SPARC-V9. In UltraSPARC-IIi it is more
efficient to use LDX/STX for accessing 64-bit data. LDD/STD take longer to execute than two 32-/64-bit loads/stores.
14.4.8
FP mem_address_not_aligned (Impdep #109, 110, 111, 112)
LDDF{A}/STDF{A} cause an LDDF/STDF_ mem_address_not_aligned trap if the effective address is 32-bit aligned but not 64-bit (doubleword) aligned. LDQF{A}/STQF{A} are not directly executed in hardware; they cause an illegal_instruction trap.
14.4.9
Supported Memory Models (Impdep #113, 121)
UltraSPARC-IIi supports all three memory models (TSO, PSO, RMO). See Section 20.2, “Supported Memory Models” on page 336.
14.4.10
I/O Operations (Impdep #118, 123)
I/O spaces and their accesses are specified in Section 8.3.7, “I/O (PCI or UPA64S) and Accesses with Side-effects” on page 78.
198
UltraSPARC-IIi User’s Manual • October 1997
14.5
14.5.1
Non-SPARC-V9 Extensions
Per-Processor TICK Compare Field of TICK Register
The SPARC-V9 TICK register is used for fine-grain measurements of time in processor cycles. The TICK Compare field (TICK_CMPR) of the TICK Register provides added functionality for thread scheduling on a per-processor basis. Non privileged accesses to this register will cause a privileged_opcode trap. See TABLE 17-3 on page 272 for a list of resets states.
TABLE 14-11 Bits
TICK_compare Register Format
Field Use RW
INT_DIS TICK_CMPR
TICK_INT interrupt enable Compare value for TICK interrupts
RW RW
INT_DIS: If set, TICK_INT interrupt generation is disabled. TICK_CMPR: Writes to the TICK_Compare Register load a value for comparison to the TICK register bits . When these values match and (INT_DIS=0) a TICK_INT is posted in the SOFTINT register. This has the effect of posting a level-14 interrupt to the processor when the processor has (PSTATE.PIL and TICK_INT. This function is independent on each processor.
14.5.2
Cache Sub-system
UltraSPARC-IIi contains one or more levels of cache. The cache sub-system architecture is described in Chapter 3, “Cache Organization.”
14.5.3
Memory Management Unit
UltraSPARC-IIi implements a multi-level memory management scheme. The MMU architecture is described in Chapter 4, “Overview of I and D-MMUs.”
Chapter 14
Implementation Dependencies
199
14.5.4
Error Handling
UltraSPARC-IIi implements a set of programmer-visible error and exception registers. These registers and their usage are described in Chapter 16, “Error Handling.”
14.5.5
Block Memory Operations
UltraSPARC-IIi supports 64-byte block memory operations utilizing a block of eight double-precision floating point registers as a temporary buffer. See Section 13.5.3, “Block Load and Store Instructions” on page 172.
14.5.6
Partial Stores
UltraSPARC-IIi supports 8-/16-/32-bit partial stores to memory. See Section 13.5.1, “Partial Store Instructions” on page 168.
14.5.7
Short Floating-Point Loads and Stores
UltraSPARC-IIi supports 8-/16-bit loads and stores to the floating-point registers. See Section 13.5.2, “Short Floating-Point Load and Store Instructions” on page 170.
14.5.8
Atomic Quad-load
UltraSPARC-IIi supports 128-bit atomic load operations to a pair of integer registers. See Section 13.6.1, “Atomic Quad Load” on page 178.
14.5.9
PSTATE Extensions: Trap Globals
UltraSPARC-IIi supports two additional sets of eight 64-bit global registers: interrupt globals and MMU globals. These additional registers are called the “trap globals.” Two 1-bit fields, PSTATE.IG and PSTATE.MG, have been added to the PSTATE register to select which set of global registers to use. The PSTATE.IG and PSTATE.MG bits are also stored with the rest of the PSTATE register in the TSTATE register when a trap is taken. See Chapter 11, “Interrupt Handling” for a description of the trap global registers. See TABLE 17-3 on page 272 for the states of these bits on reset.
200
UltraSPARC-IIi User’s Manual • October 1997
TABLE 14-12 Bits
Extended PSTATE Register
Use RW
Field
IG MG CLE TLE MM RED PEF AM PRIV IE AG
Interrupt globals enable MMU globals enable Current little endian enable Trap little endian enable Memory Model RED_state enable Floating point enable 32-bit address mask enable Privileged mode Interrupt enable Alternate global enable
RW RW RW RW RW RW RW RW RW RW RW
Note – Exiting RED_state by writing 0 to PSTATE.RED in the delay slot of a JMPL
instruction is not recommended. A noncacheable instruction prefetch may be made to the JMPL target, which may be in a cacheable memory area. This may result in a bus error on some systems, which causes an instruction_access_error trap. The trap can be masked by setting the NCEEN bit in the ESTATE_ERR_EN register to zero, but this will mask all non-correctable error checking. Exiting RED_state with DONE or RETRY avoids this problem. UltraSPARC-IIi provides Interrupt and MMU global register sets in addition to the two global register sets specified by SPARC-V9. The currently active set of global registers is specified by the AG, IG and MG bits according to TABLE 14-13 on page 202.
Note – The IG and MG fields are saved on the trap stack along with the rest of the
PSTATE register.
Chapter 14
Implementation Dependencies
201
TABLE 14-13 AG
PSTATE Global Register Selection Encoding
MG Globals in Use
IG
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
Normal MMU Interrupt Reserved Alternate Reserved Reserved Reserved
When an interrupt_vector trap (trap type=6016) is taken, UltraSPARC-IIi selects the Interrupt Global registers by setting IG and clearing AG and MG. When a fast_instruction_access_MMU_miss, fast_data_access_MMU_miss, fast_data_access_protection, data_access_exception, or instruction_access_exception trap is taken, UltraSPARC-IIi selects the MMU Global Registers by setting MG and clearing AG and IG. When any other type of trap occurs, UltraSPARC-IIi selects the Alternate Global Registers by setting AG and clearing IG and MG. Note that global register selection is the same for traps that enter RED_state. Executing a DONE or RETRY instruction restores the previous {AG, IG, MG} state before the trap is taken. These three bits can also be set or cleared by writing to the PSTATE register with a WRPR instruction.
Note – The AG, IG, and MG bits are mutually exclusive. Attempting to set a
reserved encoding using a WRPR to PSTATE generates an illegal_instruction trap. UltraSPARC-IIi does not check for a reserved encoding in TSTATE. This causes undefined results when a DONE or RETRY is executed.
14.5.10
Interrupt Vector Handling
Processors and I/O devices can interrupt a selected processor by assembling and sending an interrupt packet consisting of three 64-bit interrupt data words. This allows hardware interrupts and cross calls to have the same hardware mechanism and to share a common software interface for processing. Interrupt vectors are described in Chapter 11, “Interrupt Handling.
202
UltraSPARC-IIi User’s Manual • October 1997
14.5.11
Power Down Support and the SHUTDOWN Instruction
UltraSPARC-IIi supports power down mode to reduce power requirements during idle periods. A privileged instruction, SHUTDOWN, has been added to facilitate a software-controlled power down of the CPU and system. Power down support and the SHUTDOWN instruction are described in Section 13.6.2, “SHUTDOWN” on page 179.
14.5.12
UltraSPARC-IIi Instruction Set Extensions (Impdep #106)
The UltraSPARC-IIi CPU extends the standard SPARC-V9 instruction set with three new classes of instructions. These are designed to support power down mode (see Section 13.6.2, “SHUTDOWN” on page 179), enhance graphics functionality (see Section 13.4, “Graphics Instructions”), and improve the efficiency of memory accesses (see Section 13.5, “Memory Access Instructions). Unimplemented IMPDEP1 and IMPDEP2 opcodes encountered during execution cause an illegal_instruction trap.
14.5.13
Performance Instrumentation
UltraSPARC-IIi performance instrumentation is described in Section B.4, “Performance Instrumentation Counter Events” on page 403.
14.5.14
Debug and Diagnostics Support
UltraSPARC-IIi support for debug and diagnostics is described in Appendix A, “Debug and Diagnostics Support.
Chapter 14
Implementation Dependencies
203
204
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
15
MMU Internal Architecture
15.1
Introduction
This chapter provides detailed information about the UltraSPARC-IIi Memory Management Unit. It describes the internal architecture of the MMU and how to program it.
15.2
Translation Table Entry (TTE)
The Translation Table Entry, illustrated in FIGURE 15-1, is the UltraSPARC-IIi equivalent of a SPARC-V8 page table entry; it holds information for a single page mapping. The TTE is broken into two 64-bit words, representing the tag and data of the translation. Just as in a hardware cache, the tag is used to determine whether there is a hit in the TSB. If there is a hit, the data is fetched by software.
G 63
—
Context 48 47
— 42 41
VA_tag 0
Tag
62 61 60
V
Size NFO IE
Soft2 Diag PA Soft 13 12 7
L 6
CP 5
CV 4
E 3
P 2
W 1
G 0
Data
63 62 61 60 FIGURE 15-1
59 58 50 49 41 40
Translation Table Entry (TTE) (from TSB)
205
G: Global. If the Global bit is set, the Context field of the TTE is ignored during hit detection. This allows any page to be shared among all (user or supervisor) contexts running in the same processor. The Global bit is duplicated in the TTE tag and data to optimize the software miss handler. Context: The 13-bit context identifier associated with the TTE. VA_tag: Virtual Address Tag. The virtual page number. Bits 21 through 13 are not maintained in the tag, since these bits are used to index the smallest directmapped TSB of 64 entries.
Note – Software must sign-extend bits VA_tag to form an in-range VA.
V: Valid: If the Valid bit is set, the remaining fields of the TTE are meaningful. Note that the explicit Valid bit is redundant with the software convention of encoding an invalid TTE with an unused context. The encoding of the context field is necessary to cause a failure in the TTE tag comparison, while the explicit Valid bit in the TTE data simplifies the TLB miss handler. Size: The page size of this entry, encoded as shown in the following table.
Size Field Encoding (from TTE)
Page Size
TABLE 15-1 Size
00 01 10 11
8 kB 64 kB 512 kB 4 MB
NFO: No-Fault-Only. If this bit is set, loads with ASI_PRIMARY_NO_FAULT{_LITTLE}, ASI_SECONDARY_NO_FAULT{_LITTLE} are translated. Any other access will trap with a data_access_exception trap (FT=1016). The NFO-bit in the I-MMU is read as zero and ignored when written. If this bit is set before loading the TTE into the TLB, the iTLB miss handler should generate an error. IE: Invert Endianness. If this bit is set, accesses to the associated page are processed with inverse endianness from what is specified by the instruction (big-for-little and little-for-big). See Section 15.6, “ASI Value, Context, and Endianness Selection for Translation” on page 216 for details. In the I-MMU this bit is read as zero and ignored when written.
206
UltraSPARC-IIi User’s Manual • October 1997
Note – This bit is intended to be set primarily for noncacheable accesses. The
performance of cacheable accesses will be degraded as if the access had missed the D-cache. Soft, Soft2: Software-defined fields, provided for use by the operating system. The Soft and Soft2 fields may be written with any value; they read as zero. Diag: Used by diagnostics to access the redundant information held in the TLB structure. Diag=Used bit, Diag=RAM size bits, Diag=CAM size bits. (Size bits are 3-bit encoded as 000=8K, 001=64K, 011=512K, 111=4M.) The size bits are read-only; the Used bit is read/write. All other Diag bits are reserved. PA: The physical page number. Page offset bits for larger page sizes (PA, PA, and PA for 64 kB, 512 kB, and 4 MB pages, respectively) are stored in the TLB and returned for a Data Access read, but ignored during normal translation. L: Lock. If this bit is set, the TTE entry will be “locked down” when it is loaded into the TLB; that is, if this entry is valid, it will not be replaced by the automatic replacement algorithm invoked by an ASI store to the Data In register. The lock bit has no meaning for an invalid entry. Arbitrary entries may be locked down in the TLB. Software must ensure that at least one entry is not locked when replacing a TLB entry, otherwise the last TLB entry will be replaced. CP, CV: The cacheable-in-physically-indexed-cache and cacheable-in-virtuallyindexed-cache bits determine the placement of data in UltraSPARC-IIi caches, according to TABLE 15-2. The MMU does not operate on the cacheable bits, but merely passes them through to the cache subsystem. The CV-bit in the I-MMU is read as zero and ignored when written.
Cacheable Field Encoding (from TSB)
Meaning of TTE When Placed in: Cacheable {CP, CV} iTLB (I-cache PA-Indexed) dTLB (D-cache VA-Indexed)
TABLE 15-2
0x 10 11
Non-cacheable Cacheable E-cache, I-cache Cacheable E-cache, I-cache
Non-cacheable Cacheable E-cache only Cacheable E-cache, D-cache
E: Side-effect. If this bit is set, speculative loads and FLUSHes will trap for addresses within the page, noncacheable memory accesses other than block loads and stores are strongly ordered against other E-bit accesses, and noncacheable stores are not
Chapter 15
MMU Internal Architecture
207
merged. This bit should be set for pages that map I/O devices having side-effects. Note, however, that the E-bit does not prevent normal instruction prefetching. The E-bit in the I-MMU is read as zero and ignored when written.
Note – The E-bit does not force an uncacheable access. It is expected, but not
required, that the CP and CV bits will be set to zero when the E-bit is set. P: Privileged. If the P bit is set, only the supervisor can access the page mapped by the TTE. If the P bit is set and an access to the page is attempted when PSTATE.PRIV=0, the MMU will signal an instruction_access_exception or data_access_exception trap (FT=116). W: Writable. If the W bit is set, the page mapped by this TTE has write permission granted. Otherwise, write permission is not granted and the MMU will cause a data_access_protection trap if a write is attempted. The W-bit in the I-MMU is read as zero and ignored when written. G: Global. This bit must be identical to the Global bit in the TTE tag. Similar to the case of the Valid bit, the Global bit in the TTE tag is necessary for the TSB hit comparison, while the Global bit in the TTE data facilitates the loading of a TLB entry.
Compatibility Note – Referenced and Modified bits are maintained by software.
The Global, Privileged, and Writable fields replace the 3-bit ACC field of the SPARC-V8 Reference MMU Page Translation Entry.
15.3
Translation Storage Buffer (TSB)
The TSB is an array of TTEs managed entirely by software. It serves as a cache of the Software Translation Table, used to quickly reload the TLB in the event of a TLB miss. The discussion in this section assumes the use of the hardware support for TSB access described in Section 15.3.1, “Hardware Support for TSB Access” on page 209, although the operating system is not required to make use of this support hardware. Inclusion of the TLB entries in the TSB is not required; that is, translation information may exist in the TLB that is not present in the TSB. The TSB is arranged as a direct-mapped cache of TTEs. The UltraSPARC-IIi MMU provides precomputed pointers into the TSB for the 8 kB and 64 kB page TTEs. In each case, N least significant bits of the respective virtual page number are used as the offset from the TSB base address, with N equal to log base 2 of the number of TTEs in the TSB.
208
UltraSPARC-IIi User’s Manual • October 1997
A bit in the TSB register allows the TSB 64 kB pointer to be computed for the case of common or split 8 kB/64 kB TSB(s). No hardware TSB indexing support is provided for the 512 kB and 4 MB page TTEs. Since the TSB is entirely software managed, however, the operating system may choose to place these larger page TTEs in the TSB by forming the appropriate pointers. In addition, simple modifications to the 8 kB and 64 kB index pointers provided by the hardware allow formation of an M-way set-associative TSB, multiple TSBs per page size, and multiple TSBs per process. The TSB exists as a normal data structure in memory, and therefore may be cached. Indeed, the speed of the TLB miss handler relies on the TSB accesses hitting the level-2 cache at a substantial rate. This policy may result in some conflicts with normal instruction and data accesses, but the dynamic sharing of the level-2 cache resource should provide a better overall solution than that provided by a fixed partitioning.
FIGURE 15-2 shows both the common and shared TSB organization. The constant N is determined by the Size field in the TSB register; it may range from 512 bytes to 64 kB.
Tag1 (8 bytes) 000016
Data1 (8 bytes) 000816
N Lines in Common TSB
TagN (8 bytes) Tag1 (8 bytes)
DataN (8 bytes) Data1 (8 bytes)
2N Lines in Split TSB TagN (8 bytes)
FIGURE 15-2
DataN (8 bytes)
TSB Organization
15.3.1
Hardware Support for TSB Access
The MMU hardware provides services to allow the TLB miss handler to efficiently reload a missing TLB entry for an 8 kB or 64 kB page. These services include:
s s s s
Formation of TSB Pointers based on the missing virtual address. Formation of the TTE Tag Target used for the TSB tag comparison. Efficient atomic write of a TLB entry with a single store ASI operation. Alternate globals on MMU-signalled traps.
A typical TLB miss and refill sequence is as follows:
Chapter 15 MMU Internal Architecture 209
1. A TLB miss causes either an instruction_access_MMU_miss or a data_access_MMU_miss exception. 2. The appropriate TLB miss handler loads the TSB Pointers and the TTE Tag Target with loads from the MMU alternate space. 3. Using this information, the TLB miss handler checks to see if the desired TTE exists in the TSB. If so, the TTE Data is loaded into the TLB Data In register to initiate an atomic write of the TLB entry chosen by the replacement algorithm. 4. If the TTE does not exist in the TSB, the TLB miss handler jumps to a more sophisticated (and slower) TSB miss handler. The virtual address used in the formation of the pointer addresses comes from the Tag Access register, which holds the virtual address and context of the load or store responsible for the MMU exception. See Section 15.9, “MMU Internal Registers and ASI Operations” on page 220. (Note that there are no separate physical registers in UltraSPARC-IIi hardware for the Pointer registers, but rather they are implemented through a dynamic re-ordering of the data stored in the Tag Access and the TSB registers.) Pointers are provided by hardware for the most common cases of 8 kB and 64 kB page miss processing. These pointers give the virtual addresses where the 8 kB and 64 kB TTEs would be stored if either is present in the TSB. N is defined to be the TSB_Size field of the TSB register; it ranges from 0 to 7. Note that TSB_Size refers to the size of each TSB when the TSB is split. For a shared TSB (TSB register split field=0):
8K_POINTER = TSB_Base 64K_POINTER = TSB_Base VA VA 0000 0000
For a split TSB (TSB register split field=1):
8K_POINTER = TSB_Base 64K_POINTER = TSB_Base 0 1 VA VA 0000 0000
For a more detailed description of the pointer logic with pseudo-code and hardware implementation, see Section 15.11.3, “TSB Pointer Logic Hardware Description” on page 235. The TSB Tag Target (described in Section 15.9, “MMU Internal Registers and ASI Operations” on page 220) is formed by aligning the missing access VA (from the Tag Access register) and the current context to positions found in the description of the TTE tag. This allows an XOR instruction for TSB hit detection. These items must be locked in the TLB to avoid an error condition: TLB-miss handler, TSB and linked data, asynchronous trap handlers and data.
210
UltraSPARC-IIi User’s Manual • October 1997
These items must be locked in the TSB (not necessarily the TLB) to avoid an error condition: TSB-miss handler and data, interrupt-vector handler and data.
15.3.2
Alternate Global Selection During TLB Misses
In the SPARC-V9 normal trap mode, the software is presented with an alternate set of global registers in the integer register file. UltraSPARC-IIi provides an additional feature to facilitate fast handling of TLB misses. For the following traps, the trap handler is presented with a special set of MMU globals: fast_{instruction,data}_access_MMU_miss, {instruction,data}_access_exception, and fast_data_access_protection. The privileged_action and *mem_address_not_aligned traps use the normal alternate global registers.
Compatibility Note – The UltraSPARC-IIi MMU performs no hardware table
walking. The MMU hardware never directly reads or writes to the TSB.
15.4
MMU-Related Faults and Traps
TABLE 15-3 lists the traps recorded by the MMU.
TABLE 15-3
MMU Traps
Registers Updated (Stored State in MMU) Trap Name Trap Cause I-SFSR I-Tag Access D-SFSR, SFAR D-Tag Access
fast_instruction_access_MMU_miss instruction_access_exception fast_data_access_MMU_miss data_access_exception fast_data_access_protection privileged_action *_watchpoint *_mem_address_not_aligned
1Contents
iTLB miss Several (see below) dTLB miss Several (see below) Protection violation Use of privileged ASI Watchpoint hit Misaligned mem op
1
undefined if instruction_access_exception is due to virtual address out of range.
Chapter 15
MMU Internal Architecture
211
Note – The fast_instruction_access_MMU_miss, fast_data_access_MMU_miss, and
fast_data_access_protection traps are generated instead of instruction_access_MMU_miss, data_access_MMU_miss, and data_access_protection traps, respectively.
15.4.1
Instruction_access_MMU_miss Trap
This trap occurs when the I-MMU is unable to find a translation for an instruction access; that is, when the appropriate TTE is not in the iTLB.
15.4.2
Instruction_access_exception Trap
This trap occurs when the I-MMU is enabled and one of the following happens:
s
s
The I-MMU detects a privilege violation for an instruction fetch; that is, an attempted access to a privileged page when PSTATE.PRIV=0. Virtual address out of range and PSTATE.AM is not set. See Section 14.1.7, “44-bit Virtual Address Space” on page 184. Note that the case of JMPL/RETURN and branch-CALL-sequential are handled differently. The contents of the I-Tag Access Register are undefined in this case, but are not needed by software.
15.4.3
Data_access_MMU_miss Trap
This trap occurs when the MMU is unable to find a translation for a data access; that is, when the appropriate TTE is not in the data TLB for a memory operation.
15.4.4
Data_access_exception Trap
This trap occurs when the D-MMU is enabled and one of the following events (the D-MMU does not prioritize these) occurs.
s
s
s
The D-MMU detects a privilege violation for a data or FLUSH instruction access; that is, an attempted access to a privileged page when PSTATE.PRIV=0 A speculative (non-faulting) load or FLUSH instruction issued to a page marked with the side-effect (E-bit)=1 An atomic instruction (including 128-bit atomic load) issued to a memory address marked uncacheable in a physical cache; that is, with CP=0
212
UltraSPARC-IIi User’s Manual • October 1997
s
s
s
An invalid LDA/STA ASI value, invalid virtual address, read to write-only register, or write to read-only register, but not for an attempted user access to a restricted ASI (see the privileged_action trap described below) An access (including FLUSH) with an ASI other than ASI_{PRIMARY,SECONDARY}_NO_FAULT{_LITTLE} to a page marked with the NFO (no-fault-only) bit Virtual address out of range (including FLUSH) and PSTATE.AM is not set. See Section 4.2, “Virtual Address Translation” on page 23
The data_access_exception trap also occurs when the D-MMU is disabled and one the following occurs. s Speculative (non-faulting) load or FLUSH instruction issued when LSU_Control_Register.DP=0 s An atomic instruction (including 128-bit atomic load) is issued using the ASI_PHYS_BYPASS_EC_WITH_EBIT{_LITTLE} ASIs. In this case SFSR.FT=04 16
15.4.5
Data_access_protection Trap
This trap occurs when the MMU detects a protection violation for a data access. A protection violation is defined to be an attempted store to a page without write permission.
15.4.6
Privileged_action Trap
This trap occurs when an access is attempted using a restricted ASI while in nonprivileged mode (PSTATE.PRIV=0).
15.4.7
Watchpoint Trap
This trap occurs when watchpoints are enabled and the D-MMU detects a load or store to the virtual or physical address specified by the VA Data Watchpoint Register or the PA Data Watchpoint Register, respectively. See Section A.5, “Watchpoint Support” on page 382.
15.4.8
Mem_address_not_aligned Trap
This trap occurs when a load, store, atomic, or JMPL/RETURN instruction with a misaligned address is executed. The LSU signals this trap, but the D-MMU records the fault information in the SFSR and SFAR.
Chapter 15 MMU Internal Architecture 213
15.5
MMU Operation Summary
TABLE 15-6 on page 215 summarizes the behavior of the D-MMU; TABLE 15-6 on
page 215 summarizes the behavior of the I-MMU for normal (non-UltraSPARC-IIiinternal) ASIs using tabulated abbreviations. In each case, and for all conditions, the behavior of the MMU is given by one of the abbreviations in TABLE 15-4. TABLE 15-5 lists abbreviations for ASI types.::
Abbreviations for MMU Behavior
Meaning
TABLE 15-4 Abbreviation
ok dmiss dexc dprot imiss iexc
Normal Translation
data_access_MMU_miss trap data_access_exception trap data_access_protection trap instruction_access_MMU_miss trap instruction_access_exception trap
TABLE 15-5 Abbreviation
Abbreviations for ASI Types
Meaning
NUC PRIM SEC PRIM_NF SEC_NF U_PRIM U_SEC BYPASS
ASI_NUCLEUS* Any ASI with PRIMARY translation, except *NO_FAULT” Any ASI with SECONDARY translation, except *NO_FAULT” ASI_PRIMARY_NO_FAULT* ASI_SECONDARY_NO_FAULT* ASI_AS_IF_USER_PRIMARY* ASI_AS_IF_USER_SECONDARY* ASI_PHYS_* and also other ASIs that require the MMU to perform a bypass operation (such as D-cache access)
Note – The “*_LITTLE” versions of the ASIs behave the same as the big-endian
versions with regard to the MMU table of operations. Other abbreviations include “W” for the writable bit, “E” for the side-effect bit, and “P” for the privileged bit.
214
UltraSPARC-IIi User’s Manual • October 1997
The tables do not cover the following cases: s Invalid ASIs, ASIs that have no meaning for the opcodes listed, or non-existent ASIs; for example, ASI_PRIMARY_NO_FAULT for a store or atomic; also, access to UltraSPARC-IIi internal registers other than LDXA, LDFA, STDFA or STXA, except for I-cache diagnostic accesses other than LDDA, STDFA or STXA; see Section 6.3.2, “UltraSPARC-IIi (Non-SPARC-V9) ASI Extensions” on page 41; the MMU signals a data_access_exception trap (FT=0816) for this case s Attempted access using a restricted ASI in non-privileged mode; the MMU signals a privileged_action exception for this case s An atomic instruction (including 128-bit atomic load) issued to a memory address marked uncacheable in a physical cache (that is, with CP=0), including cases in which the D-MMU is disabled; the MMU signals a data_access_exception trap (FT=0416) for this case s A data access (including FLUSH) with an ASI other than ASI_{PRIMARY,SECONDARY}_NO_FAULT{_LITTLE} to a page marked with the NFO (no-fault-only) bit; the MMU signals a data_access_exception trap (FT=1016) for this case s Virtual address out of range (including FLUSH) and PSTATE.AM is not set; the MMU signals a data_access_exception trap (FT=2016) for this case
TABLE 15-6
D-MMU Operations for Normal ASIs
Condition Behavior ASI W TLB Miss E=0 P=0 E=0 P=1 E=1 P=0 E=1 P=1
Opcode
PRIV Mode
0 Load 1
PRIM, SEC PRIM_NF, SEC_NF PRIM, SEC, NUC PRIM_NF, SEC_NF U_PRIM, U_SEC
— — — — — — —
dmiss dmiss dmiss dmiss dmiss dmiss dmiss dmiss dmiss dmiss dmiss dmiss dmiss
ok ok ok ok ok ok ok dprot ok
dexc dexc
ok dexc ok
dexc dexc
dexc dexc dexc ok dexc dexc ok dexc dexc dprot ok dexc dexc dexc dexc dexc dprot ok dexc dexc dprot ok dexc dexc
FLUSH
0 1 0 PRIM, SEC
0 1 0 1 0 1 — —
Store or Atomic 1
PRIM, SEC, NUC
dprot ok dprot ok
U_PRIM, U_SEC — — 0 1 BYPASS BYPASS
privileged_action Bypass. No traps when D-MMU enabled, PRIV=1.
Chapter 15
MMU Internal Architecture
215
TABLE 15-7
I-MMU Operations for Normal ASIs
Behavior TLB Miss P=0 P=1
Condition PRIV Mode
0 1
imiss imiss
ok ok
iexc
See Section 6.3, “Alternate Address Spaces” on page 39 for a summary of the UltraSPARC-IIi ASI map.
15.6
ASI Value, Context, and Endianness Selection for Translation
The MMU uses a two-step process to select the context for a translation: 1. The ASI is determined (conceptually by the Integer Unit) from the instruction, trap level, and the processor endian mode 2. The context register is determined directly from the ASI. The ASI value and endianness (little or big) are determined for the I-MMU and DMMU respectively according to TABLE 15-8 and TABLE 15-9 on page 217.
Note – The secondary context is never used to fetch instructions. The I-MMU uses
the value stored in the D-MMU Primary Context register when using the Primary Context identifier; there is no I-MMU Primary Context register.
Note – The endianness of a data access is specified by three conditions: the ASI
specified in the opcode or ASI register, the PSTATE current little endian bit, and the D-MMU invert endianness bit. The D-MMU invert endianness bit does not affect the ASI value recorded in the SFSR, but does invert the endianness that is otherwise specified for the access.
216
UltraSPARC-IIi User’s Manual • October 1997
Note – The D-MMU Invert Endianness (IE) bit inverts the endianness for all
accesses to translating ASIs, including LD/ST/Atomic alternates that have specified an ASI. That is, LDXA [%g1]ASI_PRIMARY_LITTLE will be big-endian if the IE bit is on. Accesses to non-translating ASIs are not affected by the D-MMUs IE bit. See Section 6.3, “Alternate Address Spaces” on page 39 for information about nontranslating ASIs
TABLE 15-8
ASI Mapping for Instruction Accesses
Resulting Action Endianness ASI Value (in SFSR)
Condition for Instruction Access PSTATE.TL
0 >0
Big Big
ASI_PRIMARY ASI_NUCLEUS
TABLE 15-9
ASI Mapping for Data Accesses
Access Processed with: D-MMU. IE Endianness ASI Value (Recorded in SFSR)
Condition for Data Access Opcode PSTATE. TL PSTATE. CLE
0 0 1 LD/ST/Atomic/ FLUSH 0 >0 1 LD/ST/Atomic Alternate with specified ASI not ending in “_LITTLE” LD/ST/Atomic Alternate with specified ASI ending in ‘_LITTLE”
0 1 0 1 0 1 0 1 0
Big Little Little Big Big Little Little Big Big1 Little1 Little Big
ASI_PRIMARY
ASI_PRIMARY_LITTLE
ASI_NUCLEUS
ASI_NUCLEUS_LITTLE Specified ASI value from immediate field in opcode or ASI register Specified ASI value from immediate field in opcode or ASI register
Don’t Care
Don’t Care
1 0
Don’t Care
Don’t Care
1
1 Accesses to non-translating ASIs are always made in “big endian” mode, regardless of the setting of D-MMU.IE. See Section 6.3, “Al-
ternate Address Spaces” on page 39 for information about non-translating ASIs.
Chapter 15
MMU Internal Architecture
217
The context register used by the data and instruction MMUs is determined from the following table. A comprehensive list of ASI values can be found in the ASI map in Section 6.3, “Alternate Address Spaces” on page 39. The context register selection is not affected by the endianness of the access.
I-MMU and D-MMU Context Register Usage
Context Register
TABLE 15-10 ASI Value
ASI_*NUCLEUS*1 ASI_*PRIMARY*2 ASI_*SECONDARY*3 All other ASI values
Nucleus (000016 hard-wired) Primary Secondary (Not applicable, no translation)
1. Any ASI name containing the string “NUCLEUS”. 2. Any ASI name containing the string “PRIMARY”. 3. Any ASI name containing the string “SECONDARY”.
15.7
MMU Behavior During Reset, MMU Disable, and RED_state
During global reset of the UltraSPARC-IIi CPU, the following actions occur: s No change occurs in any block of the D-MMU. s No change occurs in the data path or TLB blocks of the I-MMU. s The I-MMU resets its internal state machine to normal (non-suspended) operation. s The I-MMU and D-MMU Enable bits in the LSU Control Register (see Section A.6, “LSU_Control_Register” on page 384) are set to zero. On entering RED_state, the I-MMU and D-MMU Enable bits in the LSU_Control_Register are set to zero. Either MMU is defined to be disabled when its respective MMU Enable bit equals 0; also, the I-MMU is disabled whenever the CPU is in RED_state. The D-MMU is enabled or disabled solely by the state of the D-MMU Enable bit. When the D-MMU is disabled it truncates all accesses, behaving as if ASI_PHYS_BYPASS_EC_WITH_EBIT had been used, notably with side effect bit (Ebit)=1, P=0 and CP=0. Other attribute bit settings can be found in Section 15.10, “MMU Bypass Mode” on page 234. However, if a bypass ASI is used while the D-
218
UltraSPARC-IIi User’s Manual • October 1997
MMU is disabled, the bypass operation behaves as it does when the D-MMU is enabled; that is, the access is processed with the E and CP bits as specified by the bypass ASI. When the I-MMU is disabled, it truncates all instruction accesses and passes the physically-cacheable bit (CP=0) to the cache system. The access will not generate an instruction_access_exception trap. When disabled, both the I-MMU and D-MMU correctly perform all LDXA and STXA operations to internal registers, and traps are signalled just as if the MMU were enabled. For instance, if a *NO_FAULT load is issued when the D-MMU is disabled, the D-MMU signals a data_access_exception trap (FT=0216), since accesses when the D-MMU is disabled have E=1.
Note – While the D-MMU is disabled, data in the D-cache can be accessed only
using load and store alternates to the UltraSPARC-IIi internal D-cache access ASI. Normal loads and stores bypass the D-cache. Data in the D-cache cannot be accessed using load or store alternates that use ASI_PHYS_*.
Note – No reset of the MMU is performed by a chip reset or by entering RED_state.
Before the MMUs are enabled, the operating system software must explicitly write each entry with either a valid TLB entry or an entry with the valid bit set to zero. The operation of the I-MMU or D-MMU in enabled mode is undefined if the TLB valid bits have not been set explicitly beforehand.
Chapter 15
MMU Internal Architecture
219
15.8
Compliance with the SPARC-V9 Annex F
The UltraSPARC-IIi MMU complies completely with the SPARC-V9 MMU Requirements described in Annex F of the The SPARC Architecture Manual, Version 9. TABLE 15-11 shows how various protection modes can be achieved, if necessary, through the presence or absence of a translation in the I- or D-MMU. Note that this behavior requires specialized TLB miss handler code to guarantee these conditions.
MMU Compliance w/SPARC-V9 Annex F Protection Mode
Condition TTE in D-MMU TTE in I-MMU Writable Attribute Bit Resultant Protection Mode
TABLE 15-11
Yes No Yes Yes Yes
No Yes No Yes Yes
0 Don’t Care 1 0 1
Read-only Execute-only Read/Write Read-only/Execute Read/Write/Execute
15.9
MMU Internal Registers and ASI Operations
Accessing MMU Registers
All internal MMU registers can be accessed directly by the CPU through UltraSPARC-IIi-defined ASIs. Several of the registers have been assigned their own ASI because these registers are crucial to the speed of the TLB miss handler. Allowing the use of %g0 for the address reduces the number of instructions to perform the access to the alternate space (by eliminating address formation). See Section 15.10, “MMU Bypass Mode” on page 234 for details on the behavior of the MMU during all other UltraSPARC-IIi ASI accesses. For instance, to facilitate an access to the D-cache, the MMU performs a bypass operation.
15.9.1
220
UltraSPARC-IIi User’s Manual • October 1997
Caution – STXA to an MMU register requires either a MEMBAR #Sync, FLUSH,
DONE, or RETRY before the point that the effect must be visible to load / store / atomic accesses. Either a FLUSH, DONE, or RETRY is needed before the point that the effect must be visible to instruction accesses: MEMBAR #Sync is not sufficient. In either case, one of these instructions must be executed before the next noninternal store or load of any type and on or before the delay slot of a DCTI of any type. This is necessary to avoid corrupting data. If the low order three bits of the VA are non-zero in a LDXA/STXA to/from these registers, a mem_address_not_aligned trap occurs. Writes to read-only, reads to writeonly, illegal ASI values, or illegal VA for a given ASI may cause a data_access_exception trap (FT=0816). (The hardware detects VA violations in only an unspecified lower portion of the virtual address.)
Caution – UltraSPARC-IIi does not check for out-of-range virtual addresses during an STXA to any internal register; it simply sign extends the virtual address based on VA. Software must guarantee that the VA is within range.
Writes to the TSB register, Tag Access register, and PA and VA Watchpoint Address Registers are not checked for out-of-range VA. No matter what is written to the register, VA will always be identical on a read.
TABLE 15-12 I-MMU ASI
UltraSPARC-IIi MMU Internal Registers and ASI Operations
VA Access Register or Operation Name
D-MMU ASI
5016 — — 5016 — 5016 5016 — — 5116 5216 —
5816 5816 5816 5816 5816 5816 5816 5816 5816 5916 5A16 5B16
016 816 1016 1816 2016 2816 3016 3816 4016 016 016 016
Read-only Read/Write Read/Write Read/Write Read-only Read/Write Read/Write Read/Write Read/Write Read-only Read-only Read-only
I-/D-TSB Tag Target Registers Primary Context Register Secondary Context Register I-/D-Synchronous Fault Status Registers D Synchronous Fault Address Register I-/D-TSB Registers I-/D-TLB Tag Access Registers Virtual Watchpoint Address Physical Watchpoint Address I-/D-TSB 8K Pointer Registers I-/D-TSB 64K Pointer Registers D-TSB Direct Pointer Register
Chapter 15
MMU Internal Architecture
221
TABLE 15-12 I-MMU ASI
UltraSPARC-IIi MMU Internal Registers and ASI Operations (Continued)
VA Access Register or Operation Name
D-MMU ASI
5416 5516 5616 5716
5C16 5D16 5E16 5F
016 016..1F816 016..1F816 See 15.9.10
Write-only Read/Write Read-only Write-only
I-/D-TLB Data In Registers I-/D-TLB Data Access Registers I-/D-TLB Tag Read Register I-/D-MMU Demap Operation
15.9.2
I-/D-TSB Tag Target Registers
The I- and D-TSB Tag Target registers are simply respective bit-shifted versions of the data stored in the I- and D-Tag Access registers. Since the I- or D-Tag Access registers are updated on I- or D-TLB misses, respectively, the I- and D-Tag Target registers appear to software to be updated on an I or D TLB miss.
000 63 61 60 FIGURE 15-3 Context 48 47 — 42 41 VA 0
MMU Tag Target Registers (Two Registers)
I/D Context: The context associated with the missing virtual address. I/D VA: The most significant bits of the missing virtual address.
15.9.3
Context Registers
The context registers are shared by the I- and D-MMUs. The Primary Context Register is defined as shown in FIGURE 15-4
— 63 FIGURE 15-4 13 12 PContext 0
D-MMU Primary Context Register
PContext: Context identifier for the primary address space. The Secondary Context register is defined in FIGURE 15-6.
— 63 FIGURE 15-5 13 12 SContext 0
D-MMU Secondary Context Register
222
UltraSPARC-IIi User’s Manual • October 1997
SContext: Context identifier for the secondary address space. The Nucleus Context register is hardwired to zero:
0000000000000000000000000000000000000000000000000000000000000000 63 FIGURE 15-6 0
D-MMU Nucleus Context Register
Compatibility Note – The single context register of the SPARC-V8 Reference MMU
has been replaced in UltraSPARC-IIi by the three context registers shown in Figures 15-4, 15-5, and 15-6.
Note – A STXA to the context registers requires either a MEMBAR #Sync, FLUSH, DONE, or RETRY before the point that the effect must be visible to data accesses. Either a FLUSH, DONE, or RETRY is needed before the point that the effect must be visible to instruction accesses: MEMBAR #Sync is not sufficient. In either case, one of these instructions must be executed before the next translating or bypass store or load of any type. This is necessary to avoid corrupting data.
15.9.4
I-/D-MMU Synchronous Fault Status Registers (SFSR)
The I- and D-MMU each maintain their own SFSR register, which is defined as follows:
— 63 FIGURE 15-7 24 23 ASI
—
FT
E 7 6 5
C T
16 15 14 13
P W O R W 4 3 2 1
F V 0
I- and D-MMU Synchronous Fault Status Register Format
ASI: The ASI field records the 8-bit ASI associated with the faulting instruction. This field is valid for both D-MMU and I-MMU SFSRs and for all traps in which the FV bit is set. JMPL and RETURN mem_address_not_aligned traps set the default ASI, as does a trapping non-alternate load or store; that is, to ASI_PRIMARY for PSTATE.CLE=0, or to ASI_PRIMARY_LITTLE otherwise. FT: The Fault Type field indicates the exact condition that caused the recorded fault, according to TABLE 15-13. In the D-MMU the Fault Type field is valid only for data_access_exception traps; there is no ambiguity in all other MMU trap cases. Note that the hardware does not priority-encode the bits set in the fault type register; that
Chapter 15
MMU Internal Architecture
223
is, multiple bits may be set. The FT field in the D-MMU SFSR reads zero for traps other than data_access_exception. The FT field in the I-MMU SFSR always reads zero for instruction_access_MMU_miss, and either 0116, 2016, or 4016 for instruction_access_exception, as all other fault types do not apply.
MMU Synchronous Fault Status Register FT (Fault Type) Field
TABLE 15-13 FT
Fault Type
0116 0216 0416
Privilege violation Speculative Load or Flush instruction to page marked with E-bit. This bit is zero for internal ASI accesses. Atomic (including 128-bit atomic load) to page marked uncacheable. This bit is zero for internal ASI accesses, except for atomics to DTLB_DATA_ACCESS_REG (5D16), or DTLB_DATA_IN_REG (5C16), or DTLB_TAG_READ_REG (5E16) which update according to the TLB entry accessed. Illegal LDA/STA ASI value, VA, RW, or size. Excludes cases where 0216 and 0416 are set. Access other than non-faulting load to page marked NFO. This bit is zero for internal ASI accesses. VA out of range (D-MMU and I-MMU branch, CALL, sequential) VA out of range (I-MMU JMPL or RETURN)
0816 1016 2016 4016
E: reports the side-effect bit (E) associated with the faulting data access or FLUSH instruction; set by FLUSH or translating ASI accesses (see Section 6.3, “Alternate Address Spaces” on page 39) mapped by the TLB with the E bit set and ASI_PHYS_BYPASS_EC_WITH_EBIT{_LITTLE} ASIs (15 16 and 1D16). Other cases that update the SFSR (including bypass or internal ASI accesses) set the E bit to 0. It always reads as 0 in the I-MMU. CT: Context register selection, as described in the following table; the context is set to 112 when the access does not have a translating ASI (see Section 6.3, “Alternate Address Spaces” on page 39).
MMU SFSR Context ID Field Description
I-MMU Context D-MMU Context
TABLE 15-14 Context ID
00 01 10 11
Primary Reserved Nucleus Reserved
Primary Secondary Nucleus Reserved
224
UltraSPARC-IIi User’s Manual • October 1997
PR: Privilege; set if the faulting access occurred while in Privileged mode; this field is valid for all traps in which the Fault Valid (FV) bit is set W: Write; set if the faulting access indicated a data write operation (a store or atomic load/store instruction); always reads as 0 in the I-MMU SFSR OW: Overwrite; set to one when the MMU detects a fault, if the Fault Valid bit has not been cleared from a previous fault; otherwise, it is set to zero FV: Fault Valid; set when the MMU detects a fault; cleared only on an explicit ASI write of 0 to the SFSR register; when FV is not set, the values of the remaining fields in the SFSR and SFAR are undefined The SFSR and the Tag Access registers both maintain state concerning a previous translation causing an exception. The update policy for the SFSR and the Tag Access registers is shown in TABLE 15-6 on page 215.
Note – A fast_{instruction,data}_access_MMU_miss trap does not cause the SFSR or SFAR to be written. In this case the D-SFAR information can be obtained from the D Tag Access register.
15.9.5
I-/D-MMU Synchronous Fault Address Registers (SFAR)
I-MMU Fault Address
There is no I-MMU Synchronous Fault Address register. Instead, software must read the TPC register appropriately as discussed here. For instruction_access_MMU_miss traps, TPC contains the virtual address that was not found in the I-MMU TLB. For instruction_access_exception traps, “privilege violation” fault type, TPC contains the virtual address of the instruction in the privileged page that caused the exception. For instruction_access_exception traps, “VA out of range” fault types, note that the TPC in these cases contains only a 44-bit virtual address, which is sign-extended based on bit VA for read. Therefore, use the following methods to compute the virtual address that was out of range:
15.9.5.1
Chapter 15
MMU Internal Architecture
225
s
For the branch, CALL, and sequential exception case, the TPC contains the lower 44 bits of the virtual address that is out of range. Because the hardware signextends a read of the TPC register based on VA, the contents of the TPC register XORd with FFFF F000 0000 0000 16 will give the full 64-bit out-of-range virtual address. For the JMPL or RETURN exception case, the TPC contains the virtual address of the JMPL or RETURN instruction itself. Software must disassemble the instruction to compute the out-of-range virtual address of the target.
s
15.9.5.2
D-MMU Fault Address
The Synchronous Fault Address register contains the virtual memory address of the fault recorded in the D-MMU Synchronous Fault Status register. There is no I-SFAR, since the instruction fault address is found in the trap program counter (TPC). The SFAR can be considered an additional field of the D-SFSR.
FIGURE 15-8 illustrates the D-SFAR.
Fault Address (VA) 63 FIGURE 15-8 0
D-MMU Synchronous Fault Address Register (SFAR) Format
Fault Address: is the virtual address associated with the translation fault recorded in the D-SFSR. this field is valid only when the D-SFSR Fault Valid (FV) bit is set. This field is sign-extended based on VA, so bits VA do not correspond to the virtual address used in the translation for the case of a VA-out-of-range data_access_exception trap (for this case, software must disassemble the trapping instruction).
15.9.6
I-/D- Translation Storage Buffer (TSB) Registers
The TSB registers provide information for the hardware formation of TSB pointers and tag target, to assist software in handling TLB misses quickly. If the TSB concept is not employed in the software memory management strategy, and therefore the pointer and tag access registers are not used, then the TSB registers need not contain valid data.
FIGURE 15-9 illustrates the TSB register.
TSB_Base (virtual) 63 FIGURE 15-9 13 Split 12 11 — 3 2 TSB_Size 0
I-/D-TSB Register Format
226
UltraSPARC-IIi User’s Manual • October 1997
I/D TSB_Base: provides the base virtual address of the Translation Storage Buffer. Software must ensure that the TSB Base is aligned on a boundary equal to the size of the TSB, or both TSBs in the case of a split TSB.
Caution – Stores to the TSB registers are not checked for out-of-range violations.
Reads from these registers are sign-extended based on TSB_Base. Split: When Split=1, the TSB 64 kB Pointer address is calculated assuming separate (but abutting and equally-sized) TSB regions for the 8 kB and the 64 kB TTEs. In this case, TSB_Size refers to the size of each TSB, and therefore the TSB 8 kB Pointer address calculation is not affected by the value of the Split bit. When Split=0, the TSB 64 kB Pointer address is calculated assuming that the same lines in the TSB are shared by 8 kB and 64 kB TTEs, called a “common TSB” configuration.
Caution – In the “common TSB” configuration (TSB.Split=0), 8 kB and 64 kB page
TTEs can conflict, unless the TLB miss handler explicitly checks the TTE for page size. Therefore, do not use the common TSB mode in an optimized handler. For example, suppose an 8K page at VA=200016 and a 64K page at VA=1000016 both exist, which is a legal situation. These both want to exist at the second TSB line (line 1), and have the same VA tag of 0. Therefore, there is no way for the miss handler to distinguish these TTEs based on the TTE tag alone, and unless it reads the TTE data, it may load an incorrect TTE. I/D TSB_Size: The Size field provides the size of the TSB according to the following:
s
Number of entries in the TSB (or each TSB if split)=512 × 2TSB_Size. Number of entries in the TSB ranges from 512 entries at TSB_Size=0 (8 kB common TSB, 16 kB split TSB), to 64 kB entries at TSB_Size=7 (1 MB common TSB, 2 MB split TSB).
s
Note – Any update to the TSB register immediately affects the data that is returned
from later reads of the Tag Target and TSB Pointer registers.
15.9.7
I-/D-TLB Tag Access Registers
In each MMU the Tag Access register is used as a temporary buffer for writing the TLB Entry tag information. The Tag Access register may be updated during either of the following operations:
Chapter 15
MMU Internal Architecture
227
1. When the MMU signals a trap due to a miss, exception, or protection. The MMU hardware automatically writes the missing VA and the appropriate Context into the Tag Access register to facilitate formation of the TSB Tag Target register. See TABLE 15-6 on page 215 for the SFSR and Tag Access register update policy. 2. An ASI write to the Tag Access register. Before an ASI store to the TLB Data Access registers, the operating system must set the Tag Access register to the values desired in the TLB Entry. Note that an ASI store to the TLB Data In register for automatic replacement also uses the Tag Access register, but typically the value written into the Tag Access register by the MMU hardware is appropriate.
Note – Any update to the Tag Access registers immediately affects the data that is
returned from subsequent reads of the Tag Target and TSB Pointer registers. The TLB Tag Access Registers are defined FIGURE 15-10:
VA 63 FIGURE 15-10 13 12 Context 0
I/D MMU TLB Tag Access Registers
I/D VA: The 51-bit virtual page number. Note that writes to this field are not checked for out-of-range violation, but sign extended based on VA.
Caution – Stores to the Tag Access registers are not checked for out-of-range
violations. Reads from these registers are sign-extended based on VA. I/D Context: is the 13-bit context identifier. This field reads zero when there is no associated context with the access.
15.9.8
I-/D-TSB 8 kB/64 kB Pointer and Direct Pointer Registers
These registers are provided to help the software determine the location of the missing or trapping TTE in the software-maintained TSB. The TSB 8 kB and 64 kB Pointer registers provide the possible locations of the 8 kB and 64 kB TTE, respectively. The Direct Pointer register is mapped by hardware to either the 8 kB or 64 kB Pointer register in the case of a fast_data_access_protection exception according to the known size of the trapping TTE. In the case of a 512 kB or 4 MB page miss, the Direct Pointer register returns the pointer as if the miss were from an 8 kB page.
228
UltraSPARC-IIi User’s Manual • October 1997
The TSB Pointer registers are implemented as a re-order of the current data stored in the Tag Access register and the TSB register. If the Tag Access register or TSB register is updated through a direct software write (via a STXA instruction), then the Pointer registers values will be updated as well. The bit that controls selection of 8K or 64K address formation for the Direct Pointer register is a state bit in the D-MMU that is updated during a data_access_protection exception. It records whether the page that hit in the TLB was an 64K page or a non64K page, in which case 8K is assumed. The I-/D-TSB 8 kB/64 kB Pointer registers are defined as follows:
VA 63 FIGURE 15-11 0
I-/D-MMU TSB 8 kB/64 kB Pointer and D-MMU Direct Pointer Register
VA: is the full virtual address of the TTE in the TSB, as determined by the MMU hardware. Described in Section 15.3.1, “Hardware Support for TSB Access” on page 209. Note that this field is sign-extended based on VA.
15.9.9
I-/D-TLB Data-In/Data-Access/Tag-Read Registers
Access to the TLB is complicated due to the need to provide an atomic write of a TLB entry data item (tag and data) that is larger than 64 bits, the need to replace entries automatically through the TLB entry replacement algorithm as well as provide direct diagnostic access, and the need for hardware assist in the TLB miss handler. TABLE 15-15 shows the effect of loads and stores on the Tag Access register and the TLB.
TABLE 15-15 Software Operation Load/Store Register
Effect of Loads and Stores on MMU Registers
Effect on MMU Physical Registers TLB tag TLB data Tag Access Register
Tag Read Tag Access Data In Data Access
No effect. Contents returned No effect
No effect No effect Trap with data_access_exception
No effect No effect. Contents returned
Load
No effect
No effect. Contents returned
No effect
Chapter 15
MMU Internal Architecture
229
TABLE 15-15 Software Operation Load/Store Register
Effect of Loads and Stores on MMU Registers (Continued)
Effect on MMU Physical Registers TLB tag TLB data Tag Access Register
Tag Read Tag Access Store No effect
Trap with data_access_exception No effect TLB entry determined by replacement policy written with store data TLB entry specified by STXA address written with store data Written with store data No effect
Data In
TLB entry determined by replacement policy written with contents of Tag Access Register TLB entry specified by STXA address written with contents of Tag Access Register No effect
Data Access
No effect Written with VA and context of access
TLB miss
No effect
The Data In and Data Access registers are the means of reading and writing the TLB for all operations. The TLB Data In register is used for TLB-miss and TSB-miss handler automatic replacement writes; the TLB Data Access register is used for operating system and diagnostic directed writes (writes to a specific TLB entry). Both types of registers have the same format, as follows:
V Size NFO IE 60 Soft2 50 49 Diag 41 40 PA Soft 13 12 7 L 6 CP CV 5 4 E 3 P 2 W 1 G 0
63 62 61 FIGURE 15-12
59 58
MMU I-/D-TLB Data In/Access Registers
Refer to the description of the TTE data in Section 15.2, “Translation Table Entry (TTE)” on page 205, for a complete description of the above data fields. Operations to the TLB Data In register require the virtual address to be set to zero. The format of the TLB Data Access register virtual address is as follows:
—
63 FIGURE 15-13 9 8 TLB Entry 3 2 000 0
MMU TLB Data Access Address, in Alternate Space
TLB Entry: The TLB Entry number to be accessed, in the range 0 .. 63.
The format for the Tag Read register is as follows:
VA 63 FIGURE 15-14 13 12 Context 0
I-/D-MMU TLB Tag Read Registers
230
UltraSPARC-IIi User’s Manual • October 1997
I/D VA: is the 51-bit virtual page number. Page offset bits for larger page sizes are stored in the TLB and returned for a Tag Read register read, but ignored during normal translation; that is, VA, VA, and VA for 64 kB, 512 kB and 4 MB pages, respectively. Note that this field is sign-extended based on VA. I/D Context: is the 13-bit context identifier. An ASI store to the TLB Data Access register initiates an internal atomic write to the specified TLB Entry. The TLB entry data is obtained from the store data, and the TLB entry tag is obtained from the current contents of the TLB Tag Access register. An ASI store to the TLB Data In register initiates an automatic atomic replacement of the TLB Entry pointed to by the current contents of the TLB Replacement register “Replace” field. The TLB data and tag are formed as in the case of an ASI store to the TLB Data Access register described above.
Caution – Stores to the Data In register are not guaranteed to replace the previous
TLB entry causing a fault. In particular, to change an entry’s attribute bits, software must explicitly demap the old entry before writing the new entry; otherwise, a multiple match error condition can result. An ASI load from the TLB Data Access register initiates an internal read of the data portion of the specified TLB entry. An ASI load from the TLB Tag Read register initiates an internal read of the tag portion of the specified TLB entry. ASI loads from the TLB Data In register are not supported.
15.9.10
I-/D-MMU Demap
Demap is an MMU operation, as opposed to a register operation as described above. The purpose of Demap is to remove zero, one, or more entries in the TLB. Two types of Demap operation are provided: Demap page, and Demap context. Demap page removes zero or one TLB entry that matches exactly the specified virtual page number. Demap page may in fact remove more than one TLB entry in the condition of a multiple TLB match, but this is an error condition of the TLB and has undefined results. Demap context removes zero, one, or many TLB entries that match the specified context identifier. Demap is initiated by a STXA with ASI=57 16 for I-MMU demap or 5F 16 for D-MMU demap. It removes TLB entries from an on-chip TLB. UltraSPARC-IIi does not support bus-based demap. FIGURE 15-15 shows the Demap format:
Chapter 15
MMU Internal Architecture
231
VA 63 13 12
ignored
7
Type Context 0000 6 5 4 3 0
Address
—
63 FIGURE 15-15 0
Data
MMU Demap Operation Format
VA: The virtual page number of the TTE to be removed from the TLB; This field is not used by the MMU for the Demap Context operation, but must be inrange. The virtual address for demap is checked for out-of-range violations, in the same manner as any normal MMU access. Type: The type of demap operation, as described in TABLE 15-16
MMU Demap operation Type Field Description
Demap Operation
TABLE 15-16 Type Field
0 1
Demap Page Demap Context
Context ID: Context register selection, as described in TABLE 15-17; Use of the reserved value causes the demap to be ignored.
MMU Demap Operation Context Field Description
Context Used in Demap
TABLE 15-17
Context ID Field
00 01 10 11
Primary Secondary Nucleus Reserved
Ignored: This field is ignored by hardware. (The common case is for the demap address and data to be identical.) A demap operation does not invalidate the TSB in memory. It is the responsibility of the software to modify the appropriate TTEs in the TSB before initiating any Demap operation.
232
UltraSPARC-IIi User’s Manual • October 1997
Note – A STXA to the data demap registers requires either a MEMBAR #Sync,
FLUSH, DONE, or RETRY before the point that the effect must be visible to data accesses. A STXA to the I-MMU demap registers requires a FLUSH, DONE, or RETRY before the point that the effect must be visible to instruction accesses; that is, MEMBAR #Sync is not sufficient. In either case, one of these instructions must be executed before the next translating or bypass store or load of any type. This action is necessary to avoid corrupting data. The demap operation does not depend on the value of any entry’s lock bit; that is, a demap operation demaps locked entries just as it demaps unlocked entries. The demap operation produces no output.
15.9.11
I-/D-Demap Page (Type=0)
Demap Page removes the TTE (from the specified TLB) matching the specified virtual page number and context register. The match condition with regard to the global bit is the same as a normal TLB access; that is, if the global bit is set, the contexts need not match. Virtual page offset bits , , and , for 64 kB, 512 kB, and 4 MB page TLB entries, respectively, are stored in the TLB, but do not participate in the match for that entry. This is the same condition as for a translation match.
Note – Each Demap Page operation removes only one TLB entry. A demap of a
64 kB, 512 kB, or 4 MB page does not demap any smaller page within the specified virtual address range.
15.9.12
I-/D-Demap Context (Type=1)
Demap Context removes all TTEs having the specified context from the specified TLB. If the TTE Global bit is set, the TTE is not removed.
Chapter 15
MMU Internal Architecture
233
15.10
MMU Bypass Mode
In a bypass access, the D-MMU sets the physical address equal to the truncated virtual address; that is, PA=VA. The physical page attribute bits are set as shown in TABLE 15-18.
Physical Page Attribute Bits for MMU Bypass Mode
Physical Page Attribute Bits ASI CP IE CV E P W NFO Size
TABLE 15-18
ASI_PHYS_USE_EC ASI_PHYS_USE_EC_LITTLE ASI_PHYS_BYPASS_EC_WITH_EBIT ASI_PHYS_BYPASS_EC_WITH_EBIT_LITTLE
1 0
0 0
0 0
0 1
0 0
1 1
0 0
8 KB 8 KB
Bypass applies to the I-MMU only when it is disabled. See Section 15.7, “MMU Behavior During Reset, MMU Disable, and RED_state” on page 218 for details on the use of bypass when either MMU is disabled.
Compatibility Note – In UltraSPARC-IIi the virtual address is longer than the physical address; thus, there is no need to use multiple ASIs to fill in the high-order physical address bits, as is done in SPARC-V8 machines.
15.11
15.11.1
TLB Hardware
TLB Operations
The TLB supports exactly one of the following operations per clock cycle: s Normal translation. The TLB receives a virtual address and a context identifier as input and produces a physical address and page attributes as output. s Bypass. The TLB receives a virtual address as input and produces a physical address equal to the truncated virtual address page attributes as output. s Demap operation. The TLB receives a virtual address and a context identifier as input and sets the Valid bit to zero for any entry matching the demap page or demap context criteria. This operation produces no output.
234
UltraSPARC-IIi User’s Manual • October 1997
s
s
s
Read operation. The TLB reads either the CAM or RAM portion of the specified entry. (Since the TLB entry is greater than 64 bits, the CAM and RAM portions must be returned in separate reads. See Section 15.9.9, “I-/D-TLB Data-In/DataAccess/Tag-Read Registers” on page 229.) Write operation. The TLB simultaneously writes the CAM and RAM portion of the specified entry, or the entry given by the replacement policy described in Section 15.11.2. No operation. The TLB performs no operation.
15.11.2
TLB Replacement Policy
UltraSPARC-IIi uses a 1-bit LRU scheme, very similar to that used in SuperSPARC. Each TLB entry has an associated “valid,” “used,” and “lock” bit. On an automatic write to the TLB initiated through an ASI store to register TLB Data In, the TLB picks the entry to write based on the following rules: 1. The first invalid entry will be replaced (measuring from TLB entry 0). If there is no invalid entry, then: 2. The first unused entry with its lock bit set to zero will be replaced (measuring from TLB entry 0). If no unused entry has its lock bit set to zero, then: 3. All used bits are reset, and the process is repeated from Step 2 above. Arbitrary entries may have their lock bit set, however, operation of the TLB is undefined if all entries have their lock bit set. Due to the implementation of the UltraSPARC-IIi pipeline, the MMU can and will set a TLB entry’s used bit as if the entry were hit when the load or store is an annulled or mispredicted instruction. This can be considered to cause a very slight performance degradation in the replacement algorithm, although it may also be argued that it is desirable to keep these extra entries in the TLB.
15.11.3
TSB Pointer Logic Hardware Description
The hardware diagram in FIGURE 15-16 on page 236 and the code fragment in CODE EXAMPLE 15-1 on page 237 describe the generation of the 8 kB and 64 kB pointers in more detail.
Chapter 15
MMU Internal Architecture
235
64k 8k VA VA TSB_Base 64k_not8k TSB_Base VA
TSB_Split TSB_Size 64k_not8k 43 Pointer 63
TSB Size Logic 7 8 0 9 0000 21 20 13 12 3 0
TSB Size Logic For Bit N (0 ≤ N ≤ 7) 64k_not8k (N=TSB_Size)&&TSB_Split 8k 64k TSB_Base VA VA 64k_not8k
N ≥ TSB_Size
FIGURE 15-16
Formation of TSB Pointers for 8 kB and 64 kB TTEs
236
UltraSPARC-IIi User’s Manual • October 1997
CODE EXAMPLE 15-1
Pseudo-code for UltraSPARC-IIi D-MMU Pointer Logic
int64 GenerateTSBPointer( int64 va, PointerType type, int64 TSBBase, Boolean split, int TSBSize) { int64 vaPortion; int64 TSBBaseMask; int64 splitMask; // TSBBaseMask marks the bits from TSB Base Reg TSBBaseMask = 0xffffffffffffe000 // TSB Register
// Shift va towards lsb appropriately and // zero out the original va page offset vaPortion = (va >> ((type == 8K_POINTER)? 9: 12)) & 0xfffffffffffffff0;
if (split) { // There’s only one bit in question for split splitMask = 1 : E-cache Tag Parity: E-cache state[1:0] & E-cache Tag
UltraSPARC-IIi is normally enabled to trap if it detects an E-cache tag parity error.
16.4.2
E-cache Data Parity Error
The E-cache data bus connects the UltraSPARC-IIi processor and E-cache data SRAM. The 64-bit wide data bus is protected by byte parity. Parity check failures on this bus can be caused by faulty devices or interconnects. UltraSPARC-IIi performs parity checking during; 1. Processor reads from E-cache 2. Reads due to snooping (copyback) and victimization (writeback). A parity error detected during an E-cache data access can cause UltraSPARC-IIi to trap. An E-cache data parity error detected during an instruction access causes an instruction_access_error deferred trap. An E-cache parity error detected during a data read access causes a data_access_error deferred trap. When multiple errors occur, the trap type corresponds to the first detected error. If an E-cache data parity error occurs while snooping, a bad ECC error is generated and sent to the requester. This causes an instruction/data_access_error trap at the master that requested the data. The slave processor logs error information that can be read by the master during error handling. The processor being snooped is not interrupted by this error condition.
Chapter 16
Error Handling
243
Compatibility Note – If an E-cache data parity error occurs during a write-back,
uncorrectable ECC is not forced to memory. However, the error information is logged in the AFSR and a disrupting data_access_error trap is generated.
16.4.3
DRAM ECC Error
UltraSPARC-IIi supports ECC generation and checking for all accesses to and from the DRAM. Correctable errors (CE) are fixed and the data transfer continues. Uncorrectable ECC errors on cache fills are reported for any ECC error in the cache block, not just for the referenced word. An uncorrectable error detected during an instruction access causes an instruction_access_error deferred trap. An uncorrectable error detected during a data access causes a data_access_error deferred trap. When multiple errors occur, the trap type corresponds to the first detected error.
16.4.4
CE/UE
If the Memory Control Unit detects a CE, data is corrected before it is used. This is done in these cases:
s s
PCI DMA reads from memory PCI DMA partial line writes to memory
DMA ECC errors are reported to the processor via interrupt as long as ECC checking and ECC interrupt are both enabled. Error information is logged in the DMA UE or CE AFSR/AFAR. Processor UEs and CEs are reported via trap, and are separately maskable.
16.4.5
Timeout
An attempted read of an unsupported or nonexistent device results in a timeout (TO). For example, a TO results from a read of a PCI bus address unmapped to a PCI device. Writes to non-mapped PCI addresses are reported via a late interrupt.
244
UltraSPARC-IIi User’s Manual • October 1997
16.4.6
PCI Timeout
A timeout is sent (TO in Section 16.6.2, “ECU Asynchronous Fault Status Register” on page 251) to the UltraSPARC-IIi core under a variety of PIO read error cases. If no device is mapped (or responds) to the PCI address the transaction is terminated with a master-abort and the UltraSPARC-IIi RMA Status bit is set. If a device terminates a PIO read with too many retries (disconnect with no data transfer) UltraSPARC-IIi stops retrying the access and causes a TO. A maximum of 512 retries (according to the contents of the PCI Configuration Space Retry Limit Counter Register) are allowed, although this limit can be disabled. PCI has no timeout mechanism analogous to the S-Bus timeout. However, the PCI specification does recommend that all targets issue a retry when more that 16 PCI clocks will be consumed waiting for the first data transfer. When a device claims the transaction but never signals that it is ready to transfer data, the system hangs. This situation only occurs because of a device hardware error.
16.4.7
PCI Data Parity Error
PCI requires all devices to generate parity for the address/data and cmd/byte enable busses. A single even parity bit is used for 32 bits of address/data and 4-bit cmd/byte enable bus. This section covers only parity errors on data phases, address parity errors are covered in “PCI Address Parity Error” on page 247. Reporting of parity errors may be disabled using the PER bit described in section Section 19.3.1.3, “PCI Configuration Space Command Register” on page 303. Setting PER enables UltraSPARC-IIi to report PIO data parity errors to the processor and DMA data parity errors to the bus master. When a data parity error is detected or signalled, UltraSPARC-IIi does not terminate the transaction prematurely but attempts to take it to completion. If PER is enabled, a parity error detected on PIO read is reported with a BERR to the UltraSPARC-IIi core, along with setting the DPE and DPD bits described in Section 19.3.1.4, “PCI Configuration Space Status Register” on page 303. The PCI signal ‘PERR#’ is also asserted,
Compatibility Note – If PER is disabled, UltraSPARC-IIi does not set DPE if it
detects a parity error on PIO reads. This is inconsistent with the PCI 2.1 spec.
Chapter 16
Error Handling
245
A parity error signalled via PERR# on a PIO write is logged if PER is enabled. In this case the DPD bit and the PCI PIO Write AFSR P_PERR/S_PERR bits are set in the PCI Configuration Space Status Register, the PCI PIO Write AFAR is loaded with the PIO address, and an interrupt is generated. A parity error detected during a DMA write is logged if PER is enabled. The DPE bit in the PCI Configuration Space Status Register is set, and PERR# is asserted to the bus master. Subsequent action taken by the master is device dependent.
Compatibility Note – If PER is disabled, UltraSPARC-IIi does not set DPE if it
detects a parity error on DMA writes. This is inconsistent with the PCI 2.1 spec. Data parity is not checked during DMA reads. Also, since UltraSPARC-IIi is not the bus master, PERR# is ignored. Note, however, that parity includes CBE#, which is driven to UltraSPARC-IIi, and part of the parity bit generation. It is an interesting part of the protocol that parity includes bits (CBE#/AD) driven by two different parties. If the CBE# is only wrong to UltraSPARC-IIi for a DMA read, the parity error goes unreported.
16.4.8
PCI Target-Abort
If an error occurs during an access of a PCI device, the device may terminate the transaction with a target-abort. Examples of causes of this result are unsupported byte enables, an address parity error, and device-specific errors. Any data that may have been transferred during the transaction before the target-abort occurred is corrupt and must not be used by the recipient. A PIO read terminated with a target-abort results in a Bus Error (BERR in Section 16.6.2, “ECU Asynchronous Fault Status Register” on page 251) to the UltraSPARC-IIi core and the RTA bit being set in the PCI Configuration Space Status Register. A PIO write that is terminated with a target-abort results in an asynchronous error. The P_TA/S_TA bit is set in the PCI PIO Write AFSR and the physical address loaded into the PCI PIO Write AFAR. The RTA bit in the PCI Configuration Space Status Register is also set for writes. UltraSPARC-IIi issues a target-abort upon detecting an address parity error, taking an IOMMU address translation error, and detecting a UE ECC error. The STA bit is set in the PCI Configuration Space Status Register but in all cases it is the responsibility of the bus master to report the error to system software (using SERR# or a device-specific interrupt).
246
UltraSPARC-IIi User’s Manual • October 1997
16.4.9
DMA ECC Errors
The PCI DMA UE/CE AFSR/AFAR registers log DMA errors. 1. If UE interrupts are enabled, an interrupt is posted when UltraSPARC-IIi detects a UE. 2. A UE on any of the data for a DMA read (up to a 64 byte prefetch if from memory) causes a target-abort to the PCI master device as soon as possible. This may be before the DMA read operation reaches the data transfer cycle with the UE data. 3. During DMA writes of less than 16 bytes, good data and check bits are provided for all 16 bytes when completing a Read-Modify-Write to memory. If a DMA transaction does not overwrite, or only partially overwrites, the UE data, note that bad data may then appear as good in memory.
16.4.10
IOMMU Translation Error
The IOMMU translates the PCI DMA address to a physical page address and checks for access violations. The IOMMU can detect the “access to a invalid page” and “access with protection violation” errors. An invalid error occurs when the DMA page address lacks a valid physical page mapped to it. A protection error occurs when the PCI master attempts to write to a page that is marked as read-only. Both errors are reported with a target-abort to the device.
Compatibility Note – A new feature for UltraSPARC-IIi, is that the VA of the offending DMA access is logged in the PCI DMA UE AFSR and AFAR, with the a bit set for identification as a DMA translation error.
Additional reporting of translation errors by the initiating PCI master is device dependent.
16.4.11
PCI Address Parity Error
PCI Address parity errors may be reported during PIO operations and detected or reported during DMA transfers. The PCI mechanism for reporting address parity errors is the “System Error”. Address parity error reporting can be disabled (together with all parity error reporting) using the PER PCI Configuration Space Command Register bit.
Chapter 16
Error Handling
247
After detecting a DMA address parity error, UltraSPARC-IIi first sets the DPE bit in the PCI Configuration Space Status Register. If PER is enabled, it then issues a target-abort to the master, and generates a PCI Error interrupt with the PCI_SERR bit in the PCI Control and Status Register set. If both PER and SERR_EN are enabled in the PCI Configuration Space Command Register, UltraSPARC-IIi also asserts SERR# on the bus and sets the SSE bit in the PCI Configuration Space Status Register. When a PIO address parity error is reported by a device via a SERR# assertion, UltraSPARC-IIi reports the system error as described in “PCI System Error” on page 248. Upon detecting the address parity error the target device has the options: 1. Not claiming the transaction, causing a TO trap to UltraSPARC-IIi core 2. Issuing a target-abort, resulting in an BERR trap to UltraSPARC-IIi core for reads and an asynchronous error interrupt for writes 3. Completing the cycle as if there were no error and either generating a system error or an interrupt at some later time
16.4.12
PCI System Error
The PCI System Error (PCI bus SERR# assertion) may occur on address parity errors as well as on device specific fatal errors. The assertion of SERR# can be disabled by the SERR_EN PCI Configuration Space Command Register bit. Any PCI device may assert SERR# at any time but only UltraSPARC-IIi can detect and report it to system software. SERR# assertion causes a PCI Error Interrupt and sets the PCI_SERR bit in the PCI Control and Status Register. Devices that assert the SERR# must set their SSE Status register bit. Multiple system errors generated before the system software clears the PCI CSR do not cause additional interrupts, so it is important that software check all device PCI Configuration Space Status registers.
248
UltraSPARC-IIi User’s Manual • October 1997
16.5
Summary of Error Reporting
Register abbreviations are: PCI CSR for the PCI Control/Status Register, and PCI Status for the PCI Configuration space Status register. AFR indicates both an AFSR and an AFAR.
Summary of Error Reporting
CPU Response Error Register(s) PCI Bus
TABLE 16-1 Transaction Error Type
Fetch, LD/ST, PCI DMA, Writeback
E$Tag/Data Ram Parity Error Data parity Master-abort
ETP/EDP/WP/CP (ECU AFSR), Trap BERR (ECU AFSR), Trap TO (ECU AFSR), Trap BERR (ECU AFSR), Trap TO (ECU AFSR), Trap PCI Error Interrupt PCI Error Interrupt PCI Error Interrupt PCI Error Interrupt PCI UE Interrupt PCI CE Interrupt CP (ECU AFSR), Trap PCI UE Interrupt PCI CE Interrupt -
ECU AFRs
-
PCI CSR, PCI Status, ECU AFRs PCI Status, ECU AFRs PCI Status, ECU AFRs PCI Status, ECU AFRs PCI PIO Write AFRs, PCI Status PCI PIO AFRs, PCI Status PCI PIO AFRs PCI PIO AFRs, PCI Status PCI DMA UE AFRs, PCI Status PCI DMA CE AFRs ECU AFSR PCI DMA UE AFRs PCI DMA CE AFRs PCI Status
Complete Transaction Master-abort Target-abort Cease Retries Master-abort Target-abort Cease Retries Complete Transaction Device dependent Target-abort Complete Transaction Complete Transaction Complete Transaction Complete Transaction Complete Transaction, PERR#
PIO Read
Target-abort Retry Limit Master-abort Target-abort
PIO Write Retry Limit Data Parity Address Parity Error UE-ECC CE-ECC Ecache Data Parity UE-ECC1 CE-ECC DMA Write Data Parity
Any PIO
DMA Read
Chapter 16
Error Handling
249
TABLE 16-1 Transaction Error Type
Summary of Error Reporting (Continued)
CPU Response Error Register(s) PCI Bus
Address Parity Any DMA Translation Error
PCI Error Interrupt PCI UE Interrupt
PCI Status PCI Status, PCI DMA UE AFRs IOMMU Control Reg PCI CSR, PCI Status
Target-abort Target-abort
PCI System Error
SERR# assertion
PCI Error Interrupt
-
1. Less than 16-byte aligned write to DRAM only
Unreported Errors
Some error conditions are not reported by the system. The following list gives examples of these errors:
s s s s
A A A A
write to a non-supported address. write to a read-only register in UltraSPARC-IIi is ignored. non-cached write to memory. read from a write-only register in UltraSPARC-IIi returns unknown data.
This list may not be exhaustive.
16.6
E-cache Unit (ECU) Error Registers
Note – MEMBAR #Sync is generally needed after stores to error ASI registers.
16.6.1
E-cache Error Enable Register
Name: ASI_ESTATE_ERROR_EN_REG
250
UltraSPARC-IIi User’s Manual • October 1997
ASI_ESTATE_ERROR_EN_REG: ASI== 0x4B, VA==0x0
E-cache Error Enable Register Format
Field Use Reset RW
TABLE 16-2 Bits
Reserved EPEN UEEN Reserved NCEEN CEEN
— Trap on ETP, EDP, WP, CP Trap on UE
0 0 0 0
R0 RW RW RW RW RW
Trap on TO, BERR, ETP, EDP, WP, CP, UE Trap on correctable memory read error
0 0
EPEN: Additional enable on ETP and EDP errors. See NCEEN. UEEN: Additional enable on UE errors. See NCEEN. NCEEN: If set, an uncorrectable error, time-out, bus error, SDB or E-cache data parity error causes an {instruction, data}_access_error trap and an E-cache tag parity error should cause a system fatal error; otherwise, the error is logged in the AFSR and ignored. CEEN: If set, a correctable error detected during a memory read access causes a correctable_ECC_error disrupting trap; otherwise, the error is logged in the AFSR and ignored. Examples:
s s s s
Disable all traps: [4:0] = xxx00 Disable SRAM parity, Disable ECC, Enable Bus traps: [4:0] = 00x10 Disable SRAM parity, Enable ECC, Enable Bus traps: [4:0] = 01x11 Enable SRAM parity, Enable ECC, Enable Bus traps: [4:0] = 11x11
16.6.2
ECU Asynchronous Fault Status Register
The Asynchronous Fault Status Register (AFSR) logs all errors that occurred since its fields were last cleared. The AFSR is updated according to the policy described in “Overwrite Policy” on page 258. The AFSR is logically divided into four fields:
Chapter 16
Error Handling
251
s
Bit , the accumulating multiple-error (ME) bit, is set when multiple errors with the same sticky error bit have occurred except for correctable errors. Multiple errors of different types are indicated by setting more than one of the sticky error bits. Bit , the accumulating privilege-error (PRIV), is set when an error occurs from an access generated by code executing with PSTATE.PRIV = 1. If this bit is set, system state has been corrupted. Bits are sticky error bits that record the most recently detected errors. These sticky bits accumulate errors detected since the last write that cleared this register. Bits , contain the tag and data parity syndromes respectively. Syndrome bits are endian-neutral, that is, bit 0 corresponds to bits of the Ecache data bus (i.e. bytes whose least significant four address bits are 0xf). The syndrome fields have the status of the first occurrence of the highest priority error related to that field. If no status bit is set that corresponds to that field, the contents of the syndrome field will be zero.
s
s
s
The AFSR must be explicitly cleared by software; it is not cleared automatically during a read. Writes to the AFSR sticky bits () with particular bits set clear the corresponding bits in the AFSR. Bits associated with disrupting traps must be cleared before re-enabling interrupts to prevent multiple traps for the same error. Writes to the AFSR sticky bits with particular bits clear will not affect the corresponding bits in the AFSR. If software attempts to clear error bits at the same time as an error occurs, the clear will be performed before applying logging the new error status. The syndrome field is read only and writes to this field are ignored. Name: ASI_ASYNC_FAULT_STATUS ASI_ASYNC_FAULT_STATUS: ASI== 0x4C, VA==0x0..
Asynchronous Fault Status Register
Field Use Reset RW
TABLE 16-3 Bits
Reserved ME PRIV Reserved ETP Reserved TO BERR Reserved
— Multiple Error of same type occurred Privileged code access error(s) has occurred Read as 0 Parity error in E-cache Tag SRAM Read as 0 Time-Out from PCI PIO load or Inst. fetch Bus Error from PCI PIO load or Inst. fetch Read as 0
0 0 0 0 0 0 0 0 0
R RW1C RW1C R0 RW1C R0 RW1C RW1C R0
252
UltraSPARC-IIi User’s Manual • October 1997
TABLE 16-3 Bits
Asynchronous Fault Status Register (Continued)
Field Use Reset RW
CP WP EDP UE CE Reserved ETS Reserved P_SYND
PCI DMA E-cache Parity error Data parity error from E-cache SRAMs for Writeback (victim) Data parity error from E-cache SRAMs Uncorrectable ECC error (E_SYND in SDB registers) Correctable memory read ECC error (E_SYND in SDB registers) Read as 0 E-cache Tag parity Syndrome Read as 0 Parity Syndrome
0 0 0 0 0 0 0 0 0
RW1C RW1C RW1C RW1C RW1C R0 R R0 R
TABLE 16-4 Byte address
E-cache Data Parity Syndrome Bit Orderings
E- cache data bus bits Syndrome Bit
0x7 0x6 0x5 0x4 0x3 0x2 0x1 0x0
Always 0
0 1 2 3 4 5 6 7 15:8
TABLE 16-5
E-cache Tag Parity Syndrome Bit Orderings
Syndrome Bit
E-cache Tag bus bits
Always 0
0 1 3:2
Chapter 16
Error Handling
253
16.6.3
ECU Asynchronous Fault Address Register
This register is valid when one of the Asynchronous Fault Status Register (AFSR) error status bits that capture address is set (for example, for correctable or uncorrectable memory ECC error, bus time-out or bus error). The address corresponds to the first occurrence of the highest priority error in AFSR that captures address (see “AFAR Overwrite Policy” on page 258). Address capture is reenabled by clearing all corresponding error bits in AFSR. If software attempts to write to these bits at the same time as an error that captures address occurs, the error address is stored. Name: ASI_ASYNC_FAULT_ADDRESS ASI_ASYNC_FAULT_ADDRESS: ASI== 0x4D, VA==0x0
Asynchronous Fault Address Register
Field Use RW
TABLE 16-6 Bits
Reserved PA Reserved
— Physical address of faulting transaction —
R0 RW R0
PA: Address information for the most recently captured error
Error Detection and Reporting in AFAR and AFSR
SYNDROME
5
TABLE 16-7
Error Type
PA
Trap
PRIV captured?
Trap Type6
Updated status
SW Cache flush
Uncorrectable ECC Correctable ECC E$ parity: UltraSPARC-IIi LD/Fetch E$ parity: writeback E$ parity: DMA read
Y Y N2
E_SYND3 E_SYND P_SYND
Deferred Disrupting Deferred
Y N Y
I 4, D C I, D
UE CE EDP
Yes if cacheable No Yes
N N
P_SYND P_SYND
Disrupting Disrupting
N N
D D
WP CP
No No
254
UltraSPARC-IIi User’s Manual • October 1997
TABLE 16-7
Error Detection and Reporting in AFAR and AFSR
SYNDROME
5
Error Type
PA
Trap
PRIV captured?
Trap Type6
Updated status
SW Cache flush
Bus Error1 Time-out Tag parity
Y Y N
— — ETS
Deferred Deferred Deferred
Y Y N
I, D I, D I, D
BERR TO ETP
Never for Cacheable Never for Cacheable power on clear
1. PCI transactions can cause Bus Error and Time-out. See Section 16.5, “Summary of Error Reporting” on page 249. 2. No address captured on parity errors. 3. E_SYND i s ECC syndrome; P_SYND i s parity syndrome; ETS i s E-cache Tag Parity Syndrome 4. I is instruction_access_error trap; D is data_access_error trap; C is corrected_ECC_error trap; POR is power-on reset trap
Compatibility Note – UltraSPARC-IIi does not Target Abort on a a parity error
resulting from a DMA read of E-cache. UltraSPARC caused a UE at the receiver of the data. Currently it is only reported with the same priority/trap as WP (but CP bit set).
Compatibility Note – UltraSPARC-IIi causes a Deferred Trap similarly to
UltraSPARC for ETS, without a system reset. Software can determine if a system reset is necessary.
16.6.4
SDBH Error Register
Compatibility Note – The SDB name is inherited from UltraSPARC. It logs
information about memory errors caused by the CPU core. Only the SDBH register is used. Current Solaris software interrogates if SDBL is non-zero, and ORs in a 1 to the logged pa[3] (which is always zero on UltraSPARC, but valid on UltraSPARC-IIi). For implementation efficiency, the UltraSPARC Data Buffer (SDB) error and control registers were physically separated into upper half and lower half registers. Separate ASIs are used for reading (0x7F) and writing (0x77) the SDB registers. If software attempts to clear these bits at the same time as an error occurs, the appropriate error bit is set to avoid losing error information.
Chapter 16
Error Handling
255
On UltraSPARC-IIi, writes to SDBL registers have no effect, and reads of SDBL registers always return zeros. Name: ASI_SDBH_ERROR_REG_WRITE ASI 0x77, VA==0x0 Name: ASI_SDBH_ERROR_REG_READ ASI 0x7F, VA==0x0
TABLE 16-8 Bits
SDBH Error Register Format
Field Use Reset RW
Reserved UE CE E_SYNDR
— If set, UE has occurred If set, CE has occurred ECC syndrome from system.
0 0 0 -
R0 RW1C RW1C R
E_SYNDR: ECC syndrome for correctable error from system. In case of multiple outstanding errors, only the first is recorded. Bits are sticky error bits that record the most recently detected errors. These bits accumulate errors detected since the last write that cleared this register. The SDB error registers are not cleared automatically during a read. Writes to these registers with bit-8 or bit-9 set clear the corresponding bits in the error register. Writes to the error register with particular bits clear will not affect the corresponding bits in the error register. The syndrome field is read only and writes to this field are ignored.
Note – A recorded correctable error may be overwritten by an uncorrectable error.
16.6.5
SDBL Error Register
Name: ASI_SDBL_ERROR_REG_WRITE ASI 0x77, VA==0x18 Name: ASI_SDBL_ERROR_REG_READ
256
UltraSPARC-IIi User’s Manual • October 1997
ASI 0x7F, VA==0x18 Writes have no effect, Reads return 0. This property allows existing US-I and US-II software to work without change.
16.6.6
SDBH Control Register
Name: ASI_SDBH_CONTROL_REG_WRITE ASI 0x77, VA==0x20 Name: ASI_SDBH_CONTROL_REG_READ ASI 0x7F, VA==0x20
TABLE 16-9 Bits
SDBH Control Register Format
Field Use Reset RW
Reserved Undefined VERSION F_MODE FCBV
— Reserved Always 0 Force ECC error Force check bit vector
0 0 0 0
R R R RW RW
VERSION: reads as 0 on UltraSPARC-IIi. F_MODE: If set, the contents of the FCBV field are sent with the out-going transaction, instead of the generated ECC. FCBV: Force check bit vector.
16.6.7
SDBL Control Register
Name: ASI_SDBL_CONTROL_REG_WRITE ASI 0x77, VA==0x38 Name: ASI_SDBL_CONTROL_REG_READ ASI 0x7F, VA==0x38
Chapter 16
Error Handling
257
Writes have no-effect, Reads return 0. This allows existing US-I and US-II software to work without change.
16.6.8
PCI Unit Error Registers
See Section 19.4.3, “DMA Error Registers” on page 330 and Section 19.3.0.2, “PCI PIO Write Asynchronous Fault Status/Address Registers” on page 295.
16.7
Overwrite Policy
This section describes the overwrite policy for error bits when multiple errors conditions have occurred. Errors are captured in the order that they are detected, not necessarily in program order. If an error occurs while error bits are being cleared by software, the overwrite control includes the effect of the software clear. For example, if ETP were set (which blocks E-cache tag syndrome updates) and software clears the ETP bit at the same time as an E-cache tag parity error occurs, the E-cache tag syndrome is updated.
16.7.1
AFAR Overwrite Policy
The Priority for AFAR updates is UE > CE > {TO, BE} The physical address of the first error within a class (UE, CE, {TO, BE}) is captured in the AFAR until the associated error status bit is cleared in AFSR, or an error from a higher priority class occurs. A CE error overwrites prior TO or BE errors. A UE error overwrites prior CE, TO and BE errors.
16.7.2
AFSR Parity Syndrome (P_SYND) Overwrite Policy
Parity information for the first occurrence of any error is captured in the P_SYND field of the AFSR. Error logging is re-enabled by clearing the EDP, CP, and WP fields. Any set bits in these fields inhibit update to the P_SYND field.
258
UltraSPARC-IIi User’s Manual • October 1997
16.7.3
AFSR E-cache Tag Parity (ETS) Overwrite Policy
Parity information for the first occurrence of any error is captured in the ETS field of the AFSR register. Error logging in this field can be re-enabled by clearing the ETP field.
16.7.4
SDB ECC Syndrome (E_SYND) Overwrite Policy
Priority for E_SYND updates is: UE > CE The ECC syndrome of the first error within a class (UE, CE) is captured in the E_SYND field of the SDB Error Register until the associated error status bit is cleared in the SDB error register or an error from a higher priority class occurs. A UE error overwrites prior CE errors.
Chapter 16
Error Handling
259
260
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
17
Reset and RED_state
17.1
Overview
A reset is anything that causes an entry to RED_state. UltraSPARC-IIi system resets are generated either from signals sourced from the external system or from resets generated and observed only by the UltraSPARC-IIi core. In addition to forcing entry to RED_state, various resets cause different effects in initializing processor state. The power supply, push-button, scan interface, software, error conditions, and power management logic can create externally sourced resets. Their signals are converted into power-on-reset (POR) or externally initiated reset (XIR) signals that pass to the core with different levels of effect on the system. Information from peripheral logic is stored in UltraSPARC-IIi’s Reset_Control register for software to determine the cause of the external reset. Software-Initiated Reset (SIR) and Watchdog Reset (WDR) resets result from core conditions and are generated and observed only by the processor core. Resets are used to force all or part of the system into a known state. UltraSPARC-IIi distributes the resets to all subsystems, including the UPA64S device and the primary PCI bus reset. If APB is present, it propagates this reset to the secondary PCI buses. Resets in general drive the processor into RED_state—described in Section 17.3, “RED_state”—with the exceptions described in that section.
261
Power Supply Scan Interface Pushbutton
POWER_OK SYS_RESET_L SCAN CONTROL P_RESET_L Software Reset RST_L
UPA64S Graphic Device
RIC
BUTTON_POR BUTTON_XIR
X_RESET_L
UltraSPARC-IIi
EPD E-cache SRAMs
APB
PCI_RESET_A PCI_RESET_B
FIGURE 17-1
Reset Block Diagram
The assertion of RST_L is asynchronous to UPA clock. PCI specifies an asynchronous, monotonic, deassertion for RST_L.
Note – Most existing UPA64S devices can tolerate an asynchronous deassertion of
UPA_RESET_L (the UPA spec says it should be a synchronous deassertion).
17.2
17.2.1
Resets
Power-on Reset (POR) and Initialization
A Power-on Reset occurs when the POR signal is asserted and stays until the CPU voltages reach their operating specifications and POR becomes inactive. When the POR pin is active, all other resets and traps are ignored. Power-on Reset has a trap type of 00116 at physical address offset 2016. Any pending external transactions are cancelled.
262
UltraSPARC-IIi User’s Manual • October 1997
After a Power-on Reset, software must initialize values specified as unknown in Section 17.4, “Machine State after Reset and in RED_state. In particular, the Valid and LRU bits in the I-cache (Section A.7, “I-cache Diagnostic Accesses” on page 387), the Valid bits in the D-cache (Section A.8, “D-cache Diagnostic Accesses” on page 392), and all E-cache tags and data (Section A.9, “E-cache Diagnostics Accesses” on page 394) must be cleared before enabling the caches. The iTLB and dTLB also must be initialized as described in Section 15.7, “MMU Behavior During Reset, MMU Disable, and RED_state” on page 218. Reset priorities from highest to lowest are: POR, XIR, WDR, SIR. See the following sections for explanations of each reset.
Note – Each register must be initialized before it is used. For example, CWP must
be initialized before accessing any windowed registers, since the CWP register selects which register window to access. Failure to initialize registers or states properly prior to use may result in unpredicted or incorrect results.
17.2.2
Externally Initiated Reset (XIR)
An Externally Initiated Reset is sent to the CPU via the XIR pin; it causes a SPARC-V9 XIR, which has a trap type of 003 16 at physical address offset 6016. It has higher priority than all other resets except POR. XIR is used for system debug.
17.2.3
Watchdog Reset (WDR) and error_state
A SPARC-V9 processor enters error_state when a trap occurs and TL = MAXTL. The processor signals itself internally to take a watchdog_reset (WDR) trap at physical address offset 4016. This reset affects only one processor, rather than the entire system. CWP updates due to window traps that cause watchdog traps are the same as the no watchdog trap case.
17.2.4
Software-Initiated Reset (SIR)
A Software-Initiated Reset is invoked by a SIR instruction within the processor core. This processor reset has a trap type of 004 16 at physical address offset 8016 and affects only the processor, not IO or the external system. A Signal Monitor (SIGM) instruction generates an SIR trap on the local processor.
Chapter 17
Reset and RED_state
263
17.2.5
Hardware Reset Sources
The RIC chip detects five different resets: POWER_OK from the power supply, Pushbutton PowerOnReset, Push-button XIR, Scan PowerOnReset, and ScanXIR. RIC chip combines the 5 reset conditions into 3 signals to the UltraSPARC-IIi. Based on these signals from RIC, UltraSPARC-IIi will set bits in the Reset_Control Register to allow software identify the source of reset. If the RIC IC is not used, other logic should perform a similar power-up reset function.
17.2.5.1
Power Supply
After the system power supply is turned on and before its output stabilizes, it drives the POWER_OK signal inactive to put the system in a reset state. When the supply voltage reaches a level that can power a functional system within specifications, the power supply sets POWER_OK active. RIC chip uses this signal to generate power-on-reset (POR) during the period POWER_OK is inactive to reset the system. It extends the reset period for 20K cycles at 7.159Mhz (approximately. 2.8ms) after the POWER_OK signal becomes active. The extra time is needed to allow the PLL circuitry on UltraSPARC-IIi to stabilize. RIC chip asserts SYS_RESET_L to UltraSPARC-IIi during the whole reset period. After the deassertion of SYS_RESET_L, UltraSPARC-IIi keeps RST_L (the reset signal for peripheral logic) asserted for 1666668 processor clocks which represents at least 5.5 ms at 300 MHz.
17.2.5.2
Push-button Power On Reset
Two alternative external push-buttons allow user-triggered system resets: Pushbutton POR and Push-button XIR. Push-button POR has the same effect as a POR from the power supply. The only difference between these two resets is the resultant status bits in the UltraSPARC-IIi Reset_Control Register and the state of refresh (unchanged with Push-Button POR). The B_POR bit is set to indicate that the reset is caused by push-button POR.
17.2.5.3
Push-button XIR
Push-button XIR allows a user-reset of part of the processor without resetting the whole system. UltraSPARC-IIi sets the B_XIR bit in the Reset_Control Register when a Push-button XIR is detected. XIR affects the UltraSPARC core only without affecting the rest of the system, such as UltraSPARC-IIi IO, memory and I/O devices.
264
UltraSPARC-IIi User’s Manual • October 1997
The effect of XIR on the UltraSPARC processor is different from that of POR—see Section 17.2.1, “Power-on Reset (POR) and Initialization, Section 17.2.2, “Externally Initiated Reset (XIR), and TABLE 17-3.
Note – Do not assert Button POR and Button XIR while coming out of a system
reset (power on condition). This action activates a special test mode used for acquiring test patterns and this mode runs a shortened reset sequence.
17.2.6
17.2.6.1
Software Reset
Software Power On Reset
Software can also generate a POR-equivalent reset by setting the SOFT_POR bit in the UltraSPARC-IIi Reset_Control Register. This reset is different from the SIR supported in the UltraSPARC core.
Note – As for prior UltraSPARC-based systems, refresh is not disabled
17.2.6.2
Soft XIR
Software can also issue XIR to the processor by setting the SOFT_XIR bit in the UltraSPARC-IIi Reset_Control Register. SOFT XIR has the same effect as other XIRs. Once set the bit remains set until software clears it. This allow software to discover what caused a previous XIR.
17.2.6.3
Error Reset
None, so far.
17.2.6.4
Wake-up Reset
Compatibility Note – There is no Wakeup Reset support for power management,
unlike that in prior UltraSPARC-based systems.
Chapter 17
Reset and RED_state
265
UltraSPARC-IIi, in common with UltraSPARC, can enter power-down mode by executing a SHUTDOWN instruction but refresh is stopped in this condition. Providing a reset is the only way to leave power-down mode and resume normal operation but UltraSPARC-IIi does not automatically generate this reset.
17.2.7
Effects of Resets
The effects of Resets are visible to software. Reset operation also provides sequencing to ensure proper hardware operation. For example, all busses are tristated at power up.
17.2.7.1
Major Activities as a Function of Reset
TABLE 17-1
Effects of Resets
Bit Set Mem. Refresh2 Reset PCI Devices Reset UPA64S Effect on UltraSPARC-IIi CPU/PCI
Reset Sources
POWER_OK Push-button POR Push-button XIR1 Soft POR Soft XIR
2. NC = No Change.
POR B_POR B_XIR SOFT_POR SOFT_XIR
Disable NC NC NC NC
Yes Yes No Yes No
Yes Yes No Yes No
POR POR XIR POR XIR
1. causes jump to XIR trap vector
17.2.7.2
Bus Conditions at Power up
UPA64S Address Bus
This bus is always driven
266
UltraSPARC-IIi User’s Manual • October 1997
UPA64S 64 bit Data Bus
This bus is shared by the UPA64S (graphics) interface and the memory transceiver ICs and it tristates on POR. The Fast Frame Buffer (FFB) ICs asynchronously tristate their data busses at reset.
Memory Data Bus
Driven by DRAM and the memory XCVR chips. The RAS* and CAS* signals driven by UltraSPARC-IIi are asynchronously deasserted. UltraSPARC-IIi cause the XCVR to tristate its data output pins during reset.
PCI
UltraSPARC-IIi IO asynchronously tristates this bus. It also asynchronously deasserts control signals.
17.2.7.3
Reset_Control Register (0x1FE.0000.F020)
The UltraSPARC-IIi Reset_Control indicates the source of a reset and provides control of software reset generation.
Reset_Control Register
Bits Value Description Type
TABLE 17-2 Field
Reserved POR SOFT_POR SOFT_XIR B_POR B_XIR Reserved
63:32 31 30 29 28 27 26:0
0 *1 * * * * 0
Reserved Set if the last reset was due to the assertion of Sys_Reset_L Setting to 1 causes a POR reset; stays set until software clears it Setting to 1 causes an XIR trap; stays set until software clears it Set if the last reset was due to the assertion of P_Reset_L Set if the last reset was due to the assertion of an X_Reset_L Reserved
R0 R/W1C R/W R/W R/W1C R/W1C R0
1. The highest priority reset source has its bit set. Only the bits marked with “*” are set.
Chapter 17
Reset and RED_state
267
Only one of the reset bits is set. If multiple resets occur simultaneously, the following priority order is used: 1. POR 2. B_POR 3. SOFT_POR 4. B_XIR 5. SOFT_XIR POR - Power On Reset This bit is set if the last reset was due to the assertion of SYS_RESET_L pin and occurs whenever the machine power cycles. SOFT_POR - Soft Power On Reset Writing a 1 to this bit has the same effect as power-on reset, except that a different status bit in the Reset_Control Register is set. Memory refresh is not affected. Writing a 0 to this bit clears it and has no other effect. SOFT_XIR - Soft Externally Initiated Reset Writing a 1 to this bit causes the UltraSPARC-IIi to send a XIR trap to the UltraSPARC-IIi core. Writing a 0 to this bit clears it and has no other effect. B_POR - Button Reset This bit is set as a result of a “button” reset which is caused by an external switch and the subsequent assertion of the P_RESET_L pin. It can also be caused by scan in the RIC chip. Memory refresh is not affected. The actions and results of this reset are identical to that of Power-on Reset, except for a different status bit being set. B_XIR - XIR Button Reset This bit is set as a result of a “button” XIR Reset caused by an external switch asserting the X_RESET_L signal pin. This bit can also be set by scan in the RIC chip. The actions and results of this reset are identical to that of SOFT_XIR, except that a different status bit is set.
17.3
17.3.1
RED_state
Description of RED_state
RED_state is an acronym for Reset, Error, and Debug State. It serves two mutually exclusive purposes:
268
UltraSPARC-IIi User’s Manual • October 1997
s
Indication, during trap processing, that there are no more available trap levels— that is, if another nested trap is taken, the processor will enter error_state and halt. RED_state provides system software with a restricted execution environment Provision of an execution environment for all reset processing
s
This state is entered under any of the occurrences:
s s s
Trap taken when TL = MAXTL - 1 Reset requests: POR, XIR, WDR Reset request: SIR if TL Y PIL CWP TT[TL] CCR ASI TL TPC[TL] TNPC[TL]
Unknown Unknown
272
UltraSPARC-IIi User’s Manual • October 1997
TABLE 17-3 Name
Machine State After Reset and in RED_state (Continued)
POR WDR XIR SIR RED_state‡
Fields
TSTATE
CCR ASI PSTATE CWP PC nPC NPT counter
Unknown Unknown Unknown Unknown Unknown Unknown 1 Restart at 0 Unknown Unknown Unknown Unknown Unchanged count
CCR ASI PSTATE CWP PC nPC Unchanged Restart at 0 Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged 001716 UltraSPARC-I=001016 UltraSPARC-II=001116 mask-dependent 5 7 Unchanged count
TICK CANSAVE CANRESTORE OTHERWIN CLEANWIN WSTATE
OTHER NORMAL MANUF IMPL MASK MAXTL MAXWIN all all
Unknown Unknown
VER
FSR FPRS
0 Unknown Non-SPARC-V9 ASRs
Unchanged Unchanged
SOFTINT TICK_COMPARE INT_DIS TICK_CMPR S1 S0 UT (trace user) ST (trace system) PRIV (priv access)
Unknown 1 (off) Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged
PERF_CONTROL
PERF_COUNTER GSR
Chapter 17
Reset and RED_state
273
TABLE 17-3 Name
Machine State After Reset and in RED_state (Continued)
POR WDR Non-SPARC-V9 ASIs XIR SIR RED_state‡
Fields
UPA_PORT_ID *
FC ECC_VALID ONEREAD PINT_RDQ PREQ_DQ PREQ_RQ UPACAP ID ELIM MID all 0 0 0 (off) 0 Unknown Unknown ASI FT E CTXT PRIV W OW (overwrite) FV (SFSR valid) Unknown Unknown Unknown Unknown Unknown Unknown Unknown 0 Unknown UE CE E_SYNDR FMODE FCBV NACK BUSY BUSY MID ISAPEN (sys addr err) NCEEN (non CE) CEEN (CE) PA all Unknown Unknown Unknown Unknown Unknown Unknown 0 0 Unknown 0 (off) 0 (off) 0 (off) Unknown Unchanged
FC16 0 1 1 0 1 1B16 TBD Unchanged 0 0 (off) Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged Unchanged
UPA_CONFIG LSU_CONTROL
DISPATCH CONTROL VA_WATCHPOINT PA_WATCHPOINT
I-& D-MMU_SFSR,
D-MMU_SFAR UDBH_ERR, UDBL_ERR UDBH_CONTROL, UDBL_CONTROL INTR_DISPATCH INTR_RECEIVE
ESTATE_ERR_EN
AFAR AFSR
274
UltraSPARC-IIi User’s Manual • October 1997
TABLE 17-3 Name
Machine State After Reset and in RED_state (Continued)
POR WDR XIR SIR RED_state‡
Fields
Other UltraSPARC-IIi Specific States
Processor and E-cache tags and data Cache snooping Instruction Buffers Load/Store Buffers, all outstanding accesses Mappings E-bit (sideeffect) NC-bit (noncacheable) all
Unknown
Unchanged Enabled Empty
Empty
Unchanged
Empty
iTLB, dTLB
Unknown 1 1 RSTV | 2016
Unchanged 1 1 Unchanged
RAS
‡
Processor states are updated according to this table only when RED_state is entered on a reset or trap. If software explicitly sets PSTATE.RED to 1, it must create the appropriate states itself.
Chapter 17
Reset and RED_state
275
276
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
18
MCU Control and Status Registers
Note – Registers which are designated as Write Only may be read, but the data
returned is UNDEFINED. Software should not rely on the value returned. Writes to Read Only registers have no affect. No error is reported for either case.
Compatibility Note – Prior UltraSPARC Systems used other means for controlling
these functions. Register accesses here are all 8 bytes. Reads of any size up to 8 bytes to any register are supported regardless of whether reads of that size makes sense. Writes of any size up to 8 bytes are also supported regardless of whether writes of that size makes sense. Writes of any size MAY corrupt unwritten bits in the register (i.e., writes may result in all 8 bytes being written regardless of the indicated write size). Software must insure that only the proper sized (i.e. equal to the register size) accesses are used. No hardware checking is performed. Block (64 byte) access will erroneously cause a UPA64S or PCI transaction with an undefined address. Misaligned access due to not setting the “E” bit correctly in the TTE also yields unpredictable results.
TABLE 18-1 PA
MCU CSRs
Register Name Associated Port
1FE.0000.F000 1FE.0000.F010 1FE.0000.F018
FFB_Config Mem_Control0 Mem_Control1
FFB Memory Control Unit Memory Control Unit
277
The Mem_Control registers are reset to their initial values only during PowerOnReset. (POR). This is so that refresh can operate properly during and after other resets.
18.1
FFB_Config Register (0x1FE.0000.F000)
TABLE 18-2 Field
FFB_Config Register
Bits Description Reset Type
Reserved SPRQS
63:28 27:24
Reserved. Slave P_request queue size. Initialize to max size in 2 Cycle Packets of the corresponding slave request queue. Reserved. Always oneread. UPA slave interface will not support multiple outstanding reads. Reserved
0 1
R0 RW
Reserved Oneread Reserved
23:15 14 13:0
0 1 0
R0 R1 R0
The Data Queue Size is not tracked separately, and the UPA64S device must be able to receive 64 bytes per allowed outstanding request.
278
UltraSPARC-IIi User’s Manual • October 1997
18.2
Mem_Control0 Register (0x1FE.0000.F010)
TABLE 18-3 Field
Mem_Control0 Register
Bits Description POR Type
Reserved RefEnable Reserved ECCEnable Reserved Reserved 11-bit Column Address DIMMPairPresent RefInterval
63:32 31 30:29 28 27 26:13 12 11:8 7:0
Reserved Refresh enable Reserved Enable all ECC functions Reserved (note RW) Reserved Enables 11-bit column address mode. Determines which DIMM pairs to refresh. Interval between refreshes. Each encoding is 32 processor clocks
0 0 0 0 0 0 0 0xF 0x30
R0 RW R0 RW RW R0 RW RW RW
ECCEnable
This instruction enables the MCU to perform single-bit detect and correct, and notification of single or multi-bit errors to the ECU and PBM, for possible logging and trap/interrupt generation. In general this should always be set to 1, unless DIMMs that do not support check bits are used. There are further enables for ECC related trap and interrupt generation in the ECU and PBM. See Section 16.6.1, “E-cache Error Enable Register” on page 250 and DMA UE/CE interrupt mapping registers in “Partial Interrupt Mapping Registers” on page 316 and ERRINT_EN in “PCI Control/Status Register” on page 294.
RefEnable
Main memory is composed of dynamic RAMs, which require periodic “refreshing” to maintain the contents of the memory cells. RefEnable == 1 is used to enable refresh of main memory. RefEnable == 0 disables refresh.
Chapter 18
MCU Control and Status Registers
279
POR is the only reset condition that clears RefEnable (and initializes the rest of the Mem_Control0/1). SOFT_POR, B_POR, B_XIR, and SOFT_XIR leave RefEnable unchanged and refresh continues normally. Any refresh operation in progress is aborted at the time of clearing this bit. The truncated memory signals in this case could lead to loss of data.
11-bit Column Address
The default memory addressing only supports 10-bit column address DRAMs. An additional mode was added to support a 11-bit column address. Since the total available address bits in the memory controller is constant (1 Gbyte max. addressable), the maximum number of DIMM pairs in this mode is cut in half. See “11-bit Column Addressing” on page 65.
DIMMPairPresent
Indicates the presence/absence of DIMMS to enable performance degradation caused by refreshing unpopulated DIMMs to be eliminated. A zero indicates not present, a 1 indicates present. Set by software after probing. Note that in 11-bit Column Address mode, only DIMM Pair 0 and 2 can be marked present. Pairs 1 and 3 should always be marked not present.
DIMMPairPresent Encoding
DIMM Pair 0 1 2 3
TABLE 18-4
DIMMPairPresent 0 1 2 3
280
UltraSPARC-IIi User’s Manual • October 1997
Note – Refresh must be disabled first by clearing the RefEnable bit before changing the Refresh field, or the RefInterval. Refresh may be enabled again simultaneously with writing DIMMPairPresent and RefInterval. Failure to follow this rule may result in unpredictable behavior.
TABLE 18-5
Various Memory Configurations
Base device # of devices System memory min/max config
DIMM size
8 MB 16 MB 32 MB 64 MB 64 MB 128 MB 128 MB 256 MB
1M x 4 2M x 8 4M x 4 4M x 4(banked) 8M x 8 8M x 8(banked) 16M x 4 16M x 4(banked)
18 9 18 36 9 18 18 36
16 MB/64 MB 32 MB/128 MB 64 MB/256 MB 128 MB/512 MB 128 MB/512 MB 256 MB/1 GB 256 MB/1 GB 512 MB/1 GB
RefInterval
RefInterval specifies the interval time between refreshes, in quanta of 32 CPU clocks. SW should program RefInterval according to TABLE 18-6. Values given are in hexadecimal and derived from this formula:
refreshPeriod refValue = -------------------------------------------------------------------------------------------------------------------------------------------numberOfRows × ClockPeriod × 32 × numberOfPairs
TABLE 18-6 DIMM pairs
Refresh Period (in 32XCPU clock periods) as a Function of Frequency
330-301 Mhz 300-271 Mhz 270-251 Mhz 250-225 Mhz 224-201 Mhz 200-167 Mhz 166-125 Mhz
1 2 3 4
0xA1 0x50 0x35 0x28
0x92 0x49 0x30 0x24
0x83 0x41 0x2B 0x20
0x7A 0x3D 0x28 0x1E not allowed by CPU PLL
0x61 0x30 0x20 0x18
0x51 0x28 0x1B 0x14
that is: (32 * frequency * 1000) / (2048 * 32 * DIMM pairs).
Chapter 18
MCU Control and Status Registers
281
This data is based on using 16 MB(2048 rows/32ms) EDO drams only; this configuration matches the composite DIMM specification.
18.3
Mem_Control1 Register (0x1FE.0000.F018)
Memory Control Register 1 contains fields that control the read, write, and refresh timing for the DRAM DIMMs. They allow software to optimize the memory access timing for a particular system frequency. The contents of Memory Control Register 1 can be changed as required by an electrical tuning of memory timing based on detailed SPICE analysis. Please see TABLE 18-17 for the proper programming values for this register.
Note – Only 60 ns (or faster) DRAMs are supported. See your SME representative
for the exact composite DRAM specification.
TABLE 18-7 Field
Mem_Control1 Register
Bits POR State Description Type
Reserved AMDC ARDC CSR CASRW1 RCD CP RP RAS CASRW RSC
63:30 29:27 26:24 23:21 20:18 17:15 14:12 11:9 8:6 5:3 2:0
0 0 0 2 2 4 2 4 5 2 0
Reserved. Read as zero Advance Memdata Clock Advance DRAM Read Data Clock CAS* to RAS* delay for CBR refresh cycles. CAS* length for read/write Ras to Cas Delay Cas Precharge Ras Precharge Length of RAS for Refresh Must be same as 20:18 RAS after CAS hold time
R0 R/W R/W R/W R/W R/W R/W R/W R/W R/W R/W
1. Originally had separate fields for CAS during reads and CAS during writes. However, memory timing is optimal if writes and reads use the same CAS width. Additionally, an errata caused the read CAS width to be used in one part of the write control logic. Both fields are now given the same name, and must be programmed to the same value. Results are undefined if they are different.
282
UltraSPARC-IIi User’s Manual • October 1997
Power-on reset values are indeterminate; the boot PROM should always reprogram these according to the CPU frequency table.
AMDC- Advance Memdata Clock
This instruction moves the relative timing between a transceiver clock transition and the point at which the processor latches read data driven by that transceiver (using the MEMDATA bus) This timing adjustment allows for earlier data clocking for slower clock cycles. (advance) or for later data clocking for fast clock cycles. Delaying this clocking by a cycle (relative to the recommended values) may be useful if timing is critical but it reduces hold time margin.
AMDC Arguments and Timing
Timing
TABLE 18-8 Argument
100 101 110 111 000 001 010 011
Advance Memdata clocking by 4 processor clocks (-4). Advance Memdata clocking by 3 processor clocks. Advance Memdata clocking by 2 processor clocks. Advance Memdata clocking by 1 processor clock. Default Memdata clocking Delay Memdata clocking by 1 processor clock. Delay Memdata clocking by 2 processor clocks. Delay Memdata clocking by 3 processor clocks.
ARDC- Advance Read Data Clock
Maintaining a minimum EDO DRAM CAS cycle is difficult if the DIMM loading is widely variable. Light loading on the CAS and DATA lines can make the data disappear before it is clocked and produce a hold time problem. The motherboard reference design specifies buffering to make the RAS/CAS/WE delays independent of the number of DIMMs in circuit. However, the ADDR and DATA delays do vary with DIMM population. If necessary, this field can be used to advance the clock that latches read data in the transceivers. This may be necessary when only one or two DIMM pairs are populated. It can also be used to delay the clock for heavily loaded DIMM populations.
Chapter 18
MCU Control and Status Registers
283
Current simulations indicate that the ARDC value need not be varied for the supported range and combinations of DIMM configurations.
ARDC Timing Arguments
Timing
TABLE 18-9 Argument
100 101 110 111 000 001 010 011
Advance DRAM Read data clocking by 4 processor clocks (-4). Advance DRAM Read data clocking by 3 processor clocks. Advance DRAM Read data clocking by 2 processor clocks. Advance DRAM Read data clocking by 1 processor clock. Default DRAM Read data clocking based on CAS assertion time Delay DRAM Read data clocking by 1 processor clock. Delay DRAM Read data clocking by 2 processor clocks. Delay DRAM Read data clocking by 3 processor clocks.
CSR - CAS before RAS delay timing
This Instruction controls the CAS* assertion to RAS* assertion delay for CAS* before RAS* (CBR) refresh cycles
CSR Delay Timing
Timing
TABLE 18-10 Argument
000 001 010 011 100 101 110 -111
3 CPU clocks between CAS* and RAS* 4 CPU clocks between CAS* and RAS* 5 CPU clocks between CAS* and RAS* 6 CPU clocks between CAS* and RAS* 7 CPU clocks between CAS* and RAS* 8 CPU clocks between CAS* and RAS* Reserved
284
UltraSPARC-IIi User’s Manual • October 1997
CASRW- CAS assertion for read/write cycles
CASRW controls the minimum CAS* assertion time for reads and writes.
CASRW Assertion Time
Timing
TABLE 18-11 Argument
000 001 010 011-111
CAS* low for 3 CPU clocks CAS* low for 4 CPU clocks CAS* low for 5 CPU clocks Reserved
RCD - RAS to CAS Delay
RCD controls the RAS to CAS delay during the initial part of the read or write memory cycle.
RCD Delay
Timing
TABLE 18-12 Argument
000 001 010 011 100 101 110 111
6 CPU clocks between the assertion of RAS* and the assertion of CAS* 7 CPU clocks between the assertion of RAS* and the assertion of CAS* 8 CPU clocks between the assertion of RAS* and the assertion of CAS* 11CPU clocks between the assertion of RAS* and the assertion of CAS* 12 CPU clocks between the assertion of RAS* and the assertion of CAS* 14 CPU clocks between the assertion of RAS* and the assertion of CAS* 15 CPU clocks between the assertion of RAS* and the assertion of CAS* Reserved
Chapter 18
MCU Control and Status Registers
285
CP - CAS Precharge
CP controls the CAS precharge time in between page cycles.
CP – CAS Precharge Time
Timing
TABLE 18-13 Argument
000 001 010 011-111
3 CPU clocks of CAS Precharge 4 CPU clocks of CAS Precharge 5 CPU clocks of CAS Precharge Reserved
RP - Ras Precharge
RP controls the RAS precharge time between memory cycles.
RP Timing
Timing
TABLE 18-14 Argument
000 001 010 011 100 101 110 111
8 CPU clocks of RAS precharge 9 CPU clocks of RAS precharge 10 CPU clocks of RAS precharge 11 CPU clocks of RAS precharge 12 CPU clocks of RAS precharge 14 CPU clocks of RAS precharge 15 CPU clocks of RAS precharge Reserved
286
UltraSPARC-IIi User’s Manual • October 1997
RAS
RAS is used to control the length of time that RAS is asserted during refresh cycles.
RAS Duration Time
Timing
TABLE 18-15 Argument
000 001 010 011 100 101 110-111
13 CPU clocks of RAS* assertion 15 CPU clocks of RAS* assertion 18 CPU clocks of RAS* assertion 22 CPU clocks of RAS* assertion 23 CPU clocks of RAS* assertion 24 CPU clocks of RAS* assertion Reserved
RSC-RAS after CAS delay timing
RSC controls time to deassert RAS* after CAS* at the end of a memory cycle.
RSC – RAS Deassert Time
Timing
TABLE 18-16 Argument
000 001 010 011 100 101 110-111
RAS* Assertion after CAS* for 4 CPU clocks RAS* Assertion after CAS* for 5 CPU clocks RAS* Assertion after CAS* for 6 CPU clocks RAS* Assertion after CAS* for 7 CPU clocks RAS* Assertion after CAS* for 8 CPU clocks RAS* Assertion after CAS* for 9 CPU clocks Reserved
18.4
Programming Mem_Control1
TABLE 18-17 gives program values to support one, two, three, or four DIMM pairs, with one or two banks of DRAM on each DIMM. These values are given as a function of the internal CPU operating frequency.
Chapter 18
MCU Control and Status Registers
287
These tabulated values depend upon the conditions:
s
The motherboard meeting the min/max delay specifications for RAS/CAS/ MEMADDR/DATA/MEMDATA, and all transceiver control and clock signals; The design specifications for max skew between RAS/CAS/MEMADDR/ DATA being met. The specified DIMMs being used. (buffered CAS/WE/ADDR)
s
s
Memory Control Register programming may also be used to utilize memory subsystems whose performance lies outside the suggested design specifications. Because all skew and hold time relationships for the DRAMs are not programmable, it is recommended that all designs meet the etch length specifications and employ DIMMs that meet the composite specification. It is possible that alternate values may give higher performance from 50 ns DRAM. The minimum CAS cycle with this programming is 26.5 ns (13.25 ns CAS assertion) at 300 Mhz.
Mem_Control1 values as a function of CPU frequency
ARDC CSR CASW RCD CP RP RAS RSC Mem_Control1 [31:0]
TABLE 18-17 CPU (Mhz)
AMDC
330-301 300-271 270-251 250-225 224-201 200-167 166-125 0-11
1 0 0 0
4 6 6 6
2 2 1 1
2 1 1 1
5 3 4 4
2 1 1 1
5 5 3 3
4 3 2 2
4 3 2 2
0x0C4AAB14 0x06459ACB 0x0626168A 0x0626168A
Frequency range is not supported by the CPU PLL 7 7 7 0 0 5 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0x38008241 0x38008000 0x3D000000
1. This programming is included for emulation. The PLLs should be bypassed, and an external means of supplying DRAM refresh should be provided.
Initialization of the Mem_Control registers should be performed in accordance with the probing algorithm described in Section A.10.2, “Memory Probing” on page 397.
Note – The Mem_Control register must be initialized before any memory operation,
including refresh. Before modifying the register, software must complete and inhibit all memory references and disable refresh. Wait 100 clock periods after disabling refresh to guarantee completion of any refresh in progress.
288
UltraSPARC-IIi User’s Manual • October 1997
18.5
UPA Configuration Register
The UPA_CONFIG register can be accessed at ASI 0x4A, VA==0. This is a 64-bit register; non-64-bit aligned accesses cause a mem_address_not_aligned trap. Much of the UltraSPARC-I and UltraSPARC-II functionality in this register is removed. UltraSPARC-IIi uses a register in the Memory Control Unit to restrict the number of outstanding UP64S slave requests, instead of this register. The new ELIM field is copied from UltraSPARC-II.
ELIM 39 38 37 36 35 33 32 PCON MID 22 21 17 16 PCAP 0
— 63
FIGURE 18-1
UPA_CONFIG Register Format
ELIM: This field can be used to zero upper bits of the E-cache tag address, if more address pins are used on the tag RAM than necessary. It can also be used to force the use of a smaller E-cache size than is supplied with the UltraSPARC-IIi system. Resets to 000. Must be set to a size not bigger than the E-cache data RAMS provide, otherwise incorrect E-cache operation will result. 000 has no effect on the E-cache tag address. 111 and 110 zero the 3 MSBs to create a 256-kbyte E-cache, regardless of the SRAM size or connections to the E-tag. 101 allows a 512-kbyte E-cache, if the SRAMs used are sized appropriately Otherwise, the E-cache is the size allowed by the SRAMs. 100 allows a 1-Mbyte E-cache 011 allows a 2-Mbyte E-cache, the largest supported by UltraSPARC-IIi Behavior for other encodings is Reserved. PCON[7:0]: Unused on UltraSPARC-IIi; Read as 0 MID[4:0]: Module (processor) ID register; Read as 0 PCAP[16:0]: Read as 0 on UltraSPARC-IIi
Chapter 18
MCU Control and Status Registers
289
290
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
19
UltraSPARC-IIi PCI Control and Status
19.1
Terms and Abbreviations Used
R -Read only R0 -Read zero always W -Write only R/W -Read / Write R/W1C -Read / Write with 1 to clear In this section, unless otherwise noted, all references to UltraSPARC-IIi and its registers refer to UltraSPARC-IIi’s functional IO, as opposed to the UltraSPARC-IIi core. The term UltraSPARC-IIi IO is sometimes used to emphasize this point.
Caution – Registers that are designated write only may be read, but the data
returned is undefined. and no error is reported for the access. Software should never rely on the value returned. Writes to read only registers also have no effect with no error reported.
291
19.2
Access Restrictions
Register accesses to UltraSPARC-IIi IO can be in any size from one byte to 8 bytes. Sizes and locations for the registers are given in the following sections. Reads of any size up to 8 bytes to any register are supported regardless of whether reads of that size makes sense. Writes of any size up to 8 bytes are also supported regardless of whether writes of that size makes sense. Writes of any size may corrupt unwritten bits in the register (that is, writes may result in all 8 bytes being written regardless of the indicated write size). Software must ensure that only the proper sized accesses are used. No hardware checking is performed. Block (64 byte) access to UltraSPARC-IIi IO registers cause a PCI or UPA64S transaction to an unspecified address. Misaligned access due to not correctly setting the “E” bit in the TTE also yields unpredictable results.
19.3
PCI Bus Module Registers
These registers control aspects of UltraSPARC-IIi’s PCI operations that are not defined by the PCI specification. The registers defined by the PCI specification are listed in TABLE 19-12.
PBM Registers
PA Access Size
TABLE 19-1 Register
PCI Control/Status Register PCI PIO Write AFSR PCI PIO Write AFAR PCI Diagnostic Register PCI Target Address Space Register PCI DMA Write Synchronization Register
0x1FE.0000.2000 0x1FE.0000.2010 0x1FE.0000.2018 0x1FE.0000.2020 0x1FE.0000.2028 0x1FE.0000.1C20
8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes
292
UltraSPARC-IIi User’s Manual • October 1997
TABLE 19-1 Register
PBM Registers
PA Access Size
PIO Data Buffer Diagnostics Access DMA Data Buffer Diagnostics Access DMA Data Buffer Diagnostics Access (72:64)
0x1FE.0000.5000 0x1FE.0000.5038 0x1FE.0000.5100 0x1FE.0000.5138 0x1FE.0000.51C0
8 bytes 8 bytes 8 bytes
Compatibility Note – APB has a similar additional state for each of its PCI busses.
See the APB User’s Manual for details.
Note – The bit definitions that follow assume “big-endian” type accesses.
Chapter 19
UltraSPARC-IIi PCI Control and Status
293
19.3.0.1
PCI Control/Status Register
TABLE 19-2
PCI Control and Status Register
Bits Description POR state RW
Field
Reserved PCI_MRLM_EN
63:37 36
Reserved, read as 0 1 = enable the generation of PCI Memory Read Line for Block loads, and Memory Read Multiple for 8 byte loads and noncacheable instruction fetch. 0 = force use of PCI Memory Read for all PIO reads. 1 provides a performance benefit due to APB prefetch capability for these commands Read as 0 Set when SERR# signal is asserted on the PCI bus Reserved, Read as 0, PCI bus arbitration parking enable. 0 = UltraSPARC-IIi parks when idle 1 = previous bus owner parked (including UltraSPARC-IIi) UltraSPARC-IIi arbitration priority 0 = no extra priority for CPU 1 = CPU will be granted every other bus cycle if requested. Slot arbitration priority (1 bit per slot) 0 = no extra priority 1 = slot will be granted every other bus cycle if requested. Reserved, read as 0. Enable PCI error interrupt. 0 = PCI error interrupt disabled 1 = PCI error interrupt enabled
0 0
R0 RW
Reserved PCI_SERR Reserved ARB_PARK
35 34 33:22 21
0 0 0 0
R0 R/ W1C R0 RW
CPU_PRIO1
20
0
RW
ARB_PRIO1
19:16
0
RW
Reserved ERRINT_EN
15:9 8
0 0
R0 RW
294
UltraSPARC-IIi User’s Manual • October 1997
TABLE 19-2
PCI Control and Status Register (Continued)
Bits Description POR state RW
Field
RETRY_WAIT_E N
7
Two flow control mechanisms exist for DMA. 1 = Retry if a prior DMA write is still completing. 0 = Wait if possible (some cases still retry because of unavailability of address registers). Because of the inability to provide fairness with the retry protocol, overall system performance is generally better with 0. Reserved, read as 0. PCI arbitration enable. One independent bit for each supported device on the bus. 0 = Bus requests from corresponding PCI device are ignored 1 = Bus requests from corresponding PCI device are honored.
0
RW
Reserved ARB_EN
6:4 3:0
0 0
R0 RW
1. Software must ensure that at most one bit of {CPU_PRIO, ARB_PRIO[3:0]} is set to 1. The result of setting multiple bits is undefined and can potentially result in some PCI devices being unfairly starved.
Recommended value is 0x10.0020.0101 for systems, using APB:
s s s s s s s s
PCI_MRLM_EN==1 PCI_SERR==0 ARB_PARK==1 CPU_PRIO=0 ARB_PRIO=0 ERRINT_EN=1 RETRY_WAIT_EN=0 ARB_EN=1
19.3.0.2
PCI PIO Write Asynchronous Fault Status/Address Registers
The PCI PIO Write AFSR/AFARs record error information related to PIO writes to PCI slave devices. Only asynchronous errors reported through interrupts are recorded in these registers. Asynchronous errors include any PIO write access terminated by Master Abort, Target Abort, or excessive retries, as well as any PIO write during which a parity error was signaled on the PCI bus. Although status bits for Master Abort, Target Abort and Parity Error exist in the PCI Configuration Registers for each PBM, they are duplicated in these registers to allow software to identify the chronological order of multiple errors and to associate an address with each one.
Chapter 19
UltraSPARC-IIi PCI Control and Status
295
This register contains primary error status bits and secondary error status bits .Only one of the primary error status bits can be set at any time. Primary error status can be set only when
s
None of the primary error conditions exists prior to this error or A new error is detected at the same time as software is clearing the primary error; “at the same time” means on coincident clock cycles. Setting takes precedence over clearing.
s
Secondary bits are set whenever a primary bit is set. The secondary bits are cumulative and always indicate that information has been lost because no address information has been captured. Setting of the primary error bits is independent. The AFAR and bits of AFSR log the address and status of the primary PCI PIO error. A new PCI PIO error is not logged into these bits until software clears the primary error to make the AFAR and part of the AFSR available for logging the new error.
PCI PIO Write AFSR
Bits Description POR state RW
TABLE 19-3
Field
P_MA P_TA P_RTRY P_PERR S_MA S_TA S_RTRY S_PERR Reserved BYTEMASK BLK Reserved
63 62 61 60 59 58 57 56 55:48 47:32 31 30:0
Set if primary error detected is Master Abort Set if primary error detected is Target Abort Set if primary error detected is excessive retries Set if primary error detected is parity error Set if secondary error detected is Master Abort Set if secondary error detected is Target Abort Set if secondary error detected is excessive retries Set if secondary error detected is parity error Reserved, read as 0 47:40 are always 0. 39:32 map identify the bytes stored, modulo 8 bytes. Bit 32 is byte 0. Set to 1 if failed primary transfer was a block write Reserved, read as 0
0 0 0 0 0 0 0 0 0 0 0 0
R/W1C R/W1C R/W1C R/W1C R/W1C R/W1C R/W1C R/W1C R0 R R R0
An interrupt is generated whenever
s s s
a primary error is logged, and the PBM Error Interrupt is enabled by its mapping register, and ERRINT_EN is set in the PCI Control/Status Register
296
UltraSPARC-IIi User’s Manual • October 1997
Note – The logged PA may point to the error PA + 4, if the PIO write is more than 4
bytes and the error is not on the last data beat of the PCI transaction.
TABLE 19-4 Field
PCI PIO Write AFAR
Bits Description POR state RW
Reserved PA 0
63:41 40:2 1:0
Reserved, read as 0. Physical address of error transaction. Always zero
0 Undefined 0
R0 R R0
19.3.0.3
PCI Diagnostic Register
TABLE 19-5
PCI Diagnostic Register
Bits Description POR state RW
Field
Reserved DIS_RETRY
63:7 6
Reserved, read as 0. Disable retry limit. When set to 1, UltraSPARC-IIi does not abort PIO operations after 512 retries, but continues indefinitely. Reserved. Invert PIO address parity 0 = Correct parity asserted 1 = Incorrect parity asserted for all PCI PIO address phases. Invert PIO data parity 0 = Correct parity asserted 1 = Incorrect parity asserted for all PCI PIO write data phases. Invert DMA data parity 0 = Correct parity asserted 1 = Incorrect parity asserted for all PCI DMA read data phases. Not supported. Read as 0
0 0
R0 RW
Reserved I_PIO_A_PAR
5:4 3
0 0
R0 RW
I_PIO_D_PAR
2
0
RW
I_DMA_D_PAR
1
0
RW
LPBK_EN
0
0
R0
Chapter 19
UltraSPARC-IIi PCI Control and Status
297
19.3.0.4
PCI Target Address Space Register
The PCI Target Address Space Register selectively enables 512 MByte regions as target PCI addresses for UltraSPARC-IIi.
PCI Target Address Space Register
Bits Description POR state RW
TABLE 19-6
Field
Reserved EF_enable CD_enable AB_enable 89_enable 67_enable 45_enable 23_enable 01_enable
63:8 7 6 5 4 3 2 1 0
Reserved, read as 0. Respond to 0xE000.0000-0xFFFF.FFFF Respond to 0xC000.0000-0xDFFF.FFFF Respond to 0xA000.0000-0xBFFF.FFFF Respond to 0x8000.0000-0x9FFF.FFFF Respond to 0x6000.0000-0x7FFF.FFFF Respond to 0x4000.0000-0x5FFF.FFFF Respond to 0x2000.0000-0x3FFF.FFFF Respond to 0x0000.0000-0x1FFF.FFFF
0 0 0 0 0 0 0 0 0
R0 RW RW RW RW RW RW RW RW
UltraSPARC-IIi examines single-cycle PCI addresses and responds as a target if address[31:28] select an enabled region. Dual-cycle addresses are not selectively enabled as a target for UltraSPARC-IIi. Only address[63:50]==0x3FFF indicates that UltraSPARC-IIi is the target. Note that more than one region can be enabled, and holes are allowed. No other PCI device should be enabled to respond to the UltraSPARC-IIi target address space.
19.3.0.5
PCI DMA Write Synchronization Register
Normally, interrupt delivery to the UltraSPARC-IIi core activates a Drain/Empty protocol to APB, to guarantee that any DMA writes received by APB prior to the interrupt arrival complete to memory. If another bus bridge exists behind APB, this procedure is insufficient. Software must execute a PIO load to the far side of that bus bridge,, to flush any of its posted DMA writes to APB, and then do a read of this register to synchronize with the posted writes in APB.
PCI DMA Write Synchronization Register
Bits Description RW
TABLE 19-7 Field
Reserved
63:0
Reserved, read as 0.
R0
298
UltraSPARC-IIi User’s Manual • October 1997
Completion of the load instruction (with load-use dependency or MEMBAR) signifies that synchronization is complete.
19.3.0.6
PIO Data Buffer Diagnostic Access
The PIO R/W Data Buffer Diagnostics Access provides direct PIO accesses to 8 entries of PIO data RAM.
PIO Data Buffer Diagnostics Access
Bits Description Type
TABLE 19-8 Field
Data
63:0
PIO read/write buffer data
RW
Note – Generally, usage must be a Write then a Read of a single entry. The Write
uses a PIO Data Buffer entry, so it is not possible to write all entries then read all entries.
19.3.0.7
DMA Data Buffer Diagnostic Access
The DMA Data Buffer Diagnostics Access provides direct PIO accesses to 8 entries of DMA data RAM.
DMA Data Buffer Diagnostics Access
Bits Description Type
TABLE 19-9 Field
Data
63:0
DMA read/write buffer data
RW
The (72:64) register is loaded as a side-effect of every read of one of the previous eight addresses. The data loaded is bits [72:64] of the relevant data buffer. On writes to the previous eight addresses, the contents of this register is used to write bits [72:64] of the relevant data buffer.
Chapter 19
UltraSPARC-IIi PCI Control and Status
299
19.3.0.8
DMA Data Buffer Diagnostics Access
TABLE 19-10 Field
DMA Data Buffer Diagnostics Access (72:64)
Bits Description Type
Data Data
63:8 7:0
Reserved. Undefined data when read. DMA read/write buffer data
R RW
19.3.1
PCI Configuration Space
The PBM contains a configuration header whose format is specified by the PCI Specification. The registers in the configuration header are accessed through PCI Configuration Address Space. The PBM is considered to be device 0 and function 0 on bus 0.
PBM PCI Configuration Space
PA
TABLE 19-11 Register
PBM Configuration Space. (Bus 0, Device 0, Function 0)
0x1FE.0100.0000 0x1FE.0100.00FF
Note – The PCI Configuration Address Space is little-endian. When accessing
configuration space registers, software should take advantage of one of the SPARC V9 little-endian support mechanisms to get proper byte ordering. These mechanisms include little-endian ASIs or MMU support for marking pages little-endian. A load or store instruction of the same size as the register, for example, a byte or a halfword, should always be used.
300
UltraSPARC-IIi User’s Manual • October 1997
The configuration header registers are defined by the PCI specification and PCI System Design Guide and are listed in TABLE 19-12. Some of the registers are not implemented in UltraSPARC-IIi – indicated by shading in the table. The rule used is that any optional register for which equivalent information exists elsewhere is not implemented.
Configuration Space Header Summary
PA[40:0] Size
TABLE 19-12 Register
Required PCI Device Configuration Header:
Vendor ID Device ID Command Status Revision ID Programming I/F Code Sub-class Code Base Class Code Cache Line Size Latency Timer Header Type BIST Base Address Reserved Expansion ROM Reserved Interrupt Line Interrupt Pin MIN_GNT MAX_LAT
Optional Bridge Configuration Header:
0x1FE.0100.0000 0x1FE.0100.0002 0x1FE.0100.0004 0x1FE.0100.0006 0x1FE.0100.0008 0x1FE.0100.0009 0x1FE.0100.000A 0x1FE.0100.000B 0x1FE.0100.000C 0x1FE.0100.000D 0x1FE.0100.000E 0x1FE.0100.000F 0x1FE.0100.00100x1FE.0100.0027 0x1FE.0100.00280x1FE.0100.002F 0x1FE.0100.0030 0x1FE.0100.00340x1FE.0100.003B 0x1FE.0100.003C 0x1FE.0100.003D 0x1FE.0100.003E 0x1FE.0100.003F
2 bytes 2 bytes 2 bytes 2 bytes 1 byte 1 byte 1 byte 1 byte 1 byte 1 byte 1 byte 1 byte Varies n/a 4 bytes n/a 1 byte 1 byte 1 byte 1 byte
Bus Number Subordinate Bus Number
0x1FE.0100.0040 0x1FE.0100.0041
1 byte 1 byte
Chapter 19
UltraSPARC-IIi PCI Control and Status
301
TABLE 19-12 Register
Configuration Space Header Summary (Continued)
PA[40:0] Size
Reserved Disconnect Counter Bridge Command/Status Bridge Memory Base Address Bridge Memory Limit Address DOS Read Attributes DOS Write Attributes Bridge I/O Base Address Bridge I/O Limit Address
0x1FE.0100.00420x1FE.0100.00FF Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified
n/a 1 byte 4 bytes 4 bytes 4 bytes 2 bytes 2 bytes 2 bytes 2 bytes
Note –
TABLE 19-12 lists the logical size for each register but PIO access to the registers can be in any size from 1 to 8 bytes.
19.3.1.1
PCI Configuration Space Vendor ID
Read only; VendorID = 0x108E
19.3.1.2
PCI Configuration Space Device ID
Read only; DeviceID = 0xA000
Compatibility Note – This device ID is different from that of prior PCI-based
UltraSPARC systems.
302
UltraSPARC-IIi User’s Manual • October 1997
19.3.1.3
PCI Configuration Space Command Register
TABLE 19-13
Command Register
Bits Description POR state RW
Field
Reserved FAST_EN
15:10 9
Reserved, read as 0. Enable fast back-to-back cycles to different targets. Hardwired to 0 (disabled). Enable driving of SERR# pin. Enable use of address/data stepping Hardwired to 0 (disabled). Enable reporting of parity errors Enable VGA palette snooping Hardwired to 0 (disabled). Enables use of Memory Write & Invalidate Hardwired to 0 (disabled). Enables monitoring of special cycles Hardwired to 0 (disabled). Enables ability to be bus master Hardwired to 1 (enabled). Enables response to PCI MEM cycles Hardwired to 1 (enabled). Enables response to PCI I/O cycles. Hardwired to 0 (disabled).
0 0
R0 R0
SERR_EN WAIT PER VGA MWI SPCL MSTR MEM IO
8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 1 1 0
RW R0 RW R0 R0 R0 R1 R1 R0
19.3.1.4
PCI Configuration Space Status Register
TABLE 19-14
Status Register
Bits Description POR state RW
Field
DPE SSE RMA RTA STA
15 14 13 12 11
Set if PBM detects a parity error Set if PBM signalled a system error. (detects address parity error). Set if PBM receives a master-abort Set if PBM receives a target-abort Set if PBM generates target-abort
0 0 0 0 0
R/W1C R/W1C R/W1C R/W1C R/W1C
Chapter 19
UltraSPARC-IIi PCI Control and Status
303
TABLE 19-14
Status Register (Continued)
Bits Description POR state RW
Field
DVSL DPD
10:9 8
Timing of DEVSEL#. Hardwired to 01 (medium speed response) Set when parity error occurs while PBM is bus master, if PER in command register also set. Indicates ability to accept fast back-to-back cycles as target, when the back-to-back transactions are not to the same target. Hardwired to 1 (allowed) User Definable Feature Support Hardwired to 0 (no user definable features) Indicates ability to run at 66MHz clock speed. Hardwired to 1 (66MHz capable) for PBM. Reserved, read as 0
1 0
R01 R/W1C
FASTCAP
7
1
R1
UDF_SUPPORT 66MHZ_CAPABLE
6 5
0 1
R0 R1
Reserved
4:0
0
R0
19.3.1.5
PCI Configuration Space Revision ID Register
Read only; RevisionID = 0x00; this register always reads as 0
19.3.1.6
PCI Configuration Space Programming I/F Code Register
Read only; ProgrammingIFCode = 0x00
19.3.1.7
PCI Configuration Space Sub-class Code Register
Read only; SubclassCode = 0x00 (specifies host bridge device)
19.3.1.8
PCI Configuration Space Base Class Code Register
Read only; BaseClassCode = 0x06 (specifies bridge device)
304
UltraSPARC-IIi User’s Manual • October 1997
19.3.1.9
PCI Configuration Space Latency Timer Register
This 8-bit read/write register specifies the value of the latency timer for the PBM as a bus master. Only the top five bits are implemented, giving a timer granularity of 8 PCI clocks. The bottom three bits read as 0 and should be written as 0. The maximum PIO transfer is 64 bytes, so the latency timer may apply for transfers that insert many wait states to slow targets.
Compatibility Note – A value of 0 means there is no latency timeout.
TABLE 19-15
Latency Timer Register
Bits Description POR state RW
Field
LAT_TMR_HI LAT_TMR_LO
7:3 2:0
Programmable portion of latency timer. Read only portion of latency timer. Hardwired to 0.
0 0
RW R0
Chapter 19
UltraSPARC-IIi PCI Control and Status
305
19.3.1.10
PCI Configuration Space Header Type Register
TABLE 19-16 Field
Header Type Register
Bits Description RW
MULTI_FUNC
7
Indicates whether the PBM is a multi-function PCI device. Hardwired to 0 (not multi-function). Defines layout of configuration header bytes 0x10-0x3F. Hardwired to 0 (the only defined value in PCI specification)
R0
HDR_TYPE
6:0
R0
19.3.1.11
PCI Configuration Space Bus Number
This 8-bit read/write register specifies the number of the PCI bus on which this bridge is found. Although programmable, it is not used. UltraSPARC-IIi always assumes it is on bus 0 when decoding a PIO PA to determine whether to create Type 0 or Type 1 configuration cycles.
Bus Number Register
Bits Description POR state RW
TABLE 19-17
Field
BUS
7:0
Bus number
0
RW
19.3.1.12
PCI Configuration Space Subordinate Bus Number
This 8-bit read/write register specifies the highest subordinate bus number beneath this bridge. Although programmable, it has no effect on UltraSPARC-IIi.
Subordinate Bus Number Register
Bits Description POR state RW
TABLE 19-18
Field
SUB_BUS
7:0
Highest subordinate bus number
0
RW
19.3.1.13
PCI Configuration Space Unimplemented Registers
The following registers are defined in the PCI Specification or PCI System Design Guide, but are not implemented in UltraSPARC-IIi’s PBM for the indicated reasons.
306
UltraSPARC-IIi User’s Manual • October 1997
Cache Line Size The cache line size is fixed at 64-bytes. BIST Built-In-Self-Test is not implemented in UltraSPARC-IIi. Base Address Registers The bridge has neither memory nor I/O space. Its configuration space is accessible only from the host and is hard-mapped. Interrupt Line, Interrupt Pin Do not apply; interrupt lines are handled by the RIC ASIC. Min_Gnt, Max_Lat There is no regular traffic pattern to programmed I/O. Values of zero (true) indicate there are no stringent requirements.
Chapter 19
UltraSPARC-IIi PCI Control and Status
307
19.3.2
IOMMU Registers
TABLE 19-19 Register
IOMMU Registers
Offset Access Size
IOMMU Control Register IOMMU TSB Base Address Reg. IOMMU Flush Register IOMMU Virtual Addr. Diag. Reg. IOMMU Tag Compare Diag. IOMMU LRU Queue Diag. IOMMU Tag Diag. IOMMU Data RAM Diag.
0x1FE.0000.0200 0x1FE.0000.0208 0x1FE.0000.0210 0x1FE.0000.A400 0x1FE.0000.A408 0x1FE.0000.A500 0x1FE.0000.A57F 0x1FE.0000.A580 0x1FE.0000.A5FF 0x1FE.0000.A600 0x1FE.0000.A67F
8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes 8 bytes
19.3.2.1
IOMMU Control Register
The Control Register affects diagnostic mode, IOMMU TSB size and page size.
IOMMU Control Register
Bits Description POR state Type
TABLE 19-20
Field
RESERVED ERRSTS ERR LRU_LCKEN LRU_LCKPTR
63:24 26:25 24 23 22:19
Reserved, read as zeros If ERR is set, indicates the type of error logged in the IOMMU state. Set when IOMMU is written with an ERR LRU Lock Enable Bit. When set, only the IOMMU entry specified by the Lock Pointer can be replaced. LRU Lock Pointer. Works in conjunction with the LRU Lock Enable bit to limit IOMMU replacement to a single entry. IOMMU TSB table size. Number of 8 byte entries: 0=1K, 1=2K, 2=4K, 3=8K, 4=16K, 5=32K, 6=64K, 7=128K. Reserved, read as zeros
0 0 0 0 RW
R0 R/ W1C R/ W1C RW RW
TSB_SIZE
18:16
0
RW
RESERVED
15:3
0
R0
308
UltraSPARC-IIi User’s Manual • October 1997
TABLE 19-20
IOMMU Control Register (Continued)
Bits Description POR state Type
Field
TBW_SIZE1
2
Assumed page size during IOMMU TSB lookup. 0 = 8K page 1 = 64K page Diagnostic mode enable, when set it enables the diagnostic mode. See description of IOMMU tag diagnostics. IOMMU enable bit, when set it enables the translation.
0
RW
MMU_DE
1
0
RW
MMU_EN
0
0
RW
1. If DMA mappings are always 8K pages, or mixed 8K and 64K pages, set this bit to ‘0’ so that the index is constructed for 8K lookup. If all DMA mappings are to 64K pages, set this bit to ‘1’ so that the index is based on 64K pages. When this bit is ‘0’, a 64K mapping should be placed in all eight TSB entries in which it is indexed.
Compatibility Note – ERR and ERRSTS are not present in prior PCI-based
UltraSPARC systems.
TABLE 19-21
Address Space Size And Base Address Determination.
TBW_SIZE == 0 TBW_SIZE == 1 VA Space Size TSB_Index
TSB_SIZE
VA Space Size
TSB Index
0 1 2 3 4 5 6 7
8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB 1GB
VA,000 VA.000 VA,000 VA,000 VA,000 VA,000 VA,000 VA,000
64 MB 128 MB 256 MB 512 MB 1 GB 2 GB not allowed1 not allowed1
VA,000 VA,000 VA,000 VA,000 VA,000 VA,000 ---
1. Hardware does not prevent illegal combinations from being programmed. If an illegal combination is programmed into the IOMMU, all translation requests will be rejected as invalid.
Address space size and TSB offset are affected by TSB_SIZE and TBW_SIZE as shown in TABLE 19-21.
Chapter 19
UltraSPARC-IIi PCI Control and Status
309
IOMMU locking
For diagnostics and debugging, the IOMMU has the capability of restricting itself to use just a single entry of the IOMMU. This is controlled by the LRU_LCKEN and LRU_LCKPTR fields of the IOMMU Control Register. To properly turn locking on the following sequence is required:
s
Set MMU_EN to 0 Set LRU_LCKEN to 1 (must be a separate PIO write) Set LRU_LCKPTR to desired value (may be combined with previous PIO) Set MME_DE to 1 (may be combined with previous PIO) Invalidate all IOMMU entries Set MMU_EN to 1 and MMU_DE to 0.
s
s
s
s
s
To unlock the IOMMU:
s
Set LRU_LCKEN to 0
19.3.2.2
IOMMU TSB Base Address Register
The IOMMU TSB Base Address Register contains the pointer to the first-entry of the IOMMU TSB table. Together with part of the virtual address it uniquely identifies the address from which hardware should fetch the TTE from the IOMMU TSB table. The IOMMU TSB table has to be aligned on an 8K boundary. The lower order 13 bits are assumed to be 0x0 during IOMMU TSB table lookup. Tables larger than 8K bytes are only constrained to be on 8K boundaries rather than having to be size aligned.
IOMMU TSB Base Address Register
Bits Description Type
TABLE 19-22 Field
RESERVED ZERO TSB_BASE
63:41 40:13 33:13
Reserved, read as zeros Bits 40:34 of the TSB physical address are always zero Bits [33:13] of the TSB physical address. 33:30 should always be zero, since only 1-Gbyte of physical memory is supported. Reserved, read as zeros
R0 R0 RW
RESERVED
12:0
R0
310
UltraSPARC-IIi User’s Manual • October 1997
19.3.2.3
Flush Address Register
This is a write-only pseudo-register to allow software perform address-based flush of a mapping from IOMMU. The data written to this address contains the page number to be flushed. A IOMMU entry with matched page number is invalidated.
Flush Address Register
Bits Description Type
TABLE 19-23 Field
RESERVED FLUSH_VPN
63:32 31:13
Reserved, write has no effect 31:16 = virtual page number if 64K page; bits 15:13 are don’t care 31:13 = virtual page number if 8K page Reserved, write has no effect
W W
RESERVED
12:0
W
Note – No hardware mechanisms exist to solve the potential race between a DMA
translation needing a IOMMU entry and the write to the Flush Address Register intended to flush that entry. Software must manage the interlock by guaranteeing that no DMA transfers can involve the page being flushed.
19.3.2.4
IOMMU TAG Diagnostics Access
The IOMMU Tag Diagnostics Access provides a diagnostics path to the 16-entry IOMMU Tag when the MMU_DE bit in the IOMMU Control Register is turned on.
IOMMU Tag Diagnostics Access
Bits Description Type
TABLE 19-24 Field
RESERVED ERRSTS
63:25 24:23
Reserved, read as zeros Error Status: 00 = Reserved 01 = Invalid Error 10 = Reserved 11 = UE Error on TTE read When set to 1, indicates that there is an error associated with this IOMMU entry. The specific error is indicated by the ERRSTS field. Writable bit. when set, the page mapped by the IOMMU has write permission granted.
R0 RW
ERR
22
RW
W
21
RW
Chapter 19
UltraSPARC-IIi PCI Control and Status
311
TABLE 19-24 Field
IOMMU Tag Diagnostics Access
Bits Description Type
S SIZE VPN
20 19 18:0
Stream bit. (unused) Page Size, 0=8K and 1=64K. VPN[31:13]
RW RW RW
Note – Diagnostic accesses should ensure that multiple match conditions are not
generated. The result of multiple matches is unpredictable.
Compatibility Note – Unlike prior PCI-based UltraSPARC systems, UltraSPARC-IIi arbitrates between IOMMU CSR access and DMA access. This property may allow software more flexibility.
19.3.2.5
IOMMU Data RAM Diagnostic Access
The IOMMU Data Diagnostics Access provides direct PIO accesses to 16 entries of IOMMU Data RAM. The MMU_DE bit in the IOMMU Control Register must be turned on to perform the accesses. TABLE 19-25 shows the information included in the returned data.
IOMMU Data RAM Diagnostics Access
Bits Description Type
TABLE 19-25 Field
RESERVED V U C PA[40:34] PA[33:13]
63:31 30 29 28 27:21 20:0
Reserved, read as zeros Valid bit, when set, the TLB data field is meaningful Used bit. Affects the LRU replacement. Cacheable bit. 1=Cacheable access, 0=Noncacheable. Not stored. All 1’s if Noncacheable, All 0’s if Cacheable. 21-bit Physical Page Number
R0 RW RW RW R RW
Compatibility Note – The Used bit does not exist in prior PCI-based UltraSPARC
systems, and is used by the pseudo-LRU replacement algorithm.
312
UltraSPARC-IIi User’s Manual • October 1997
19.3.2.6
Virtual Address Diagnostic Register
This register is used to set up the virtual address for the IOMMU compare diagnostic. The virtual address is written to this register and enables the compare results to be read from the IOMMU.
Virtual Address Diagnostic Register
Bits Description Type
TABLE 19-26 Field
RESERVED VPN RESERVED
63:32 31:13 12:00
Reserved, read as 0. Virtual page number. Reserved, read as 0.
R0 R/W R0
19.3.2.7
IOMMU Tag Compare Diagnostic Access
TABLE 19-27 Field
IOMMU Tag Comparator Diagnostics Access
Bits Description Type
RESERVED COMP
63:16 15:0
Reserved, read as zeros IOMMU tag comparator output for each entry.
R0 R
Note – The IOMMU Tag Compare Diagnostics Access provides the diagnostics path
to the 16-entry IOMMU Tag Comparator when the MMU_DE bit in the IOMMU Control Register is turned on. Bit 0 represents the comparison result of the first IOMMU Tag entry, and bit 15 represents the last.
19.3.3
Interrupt Registers
Interrupts load the Interrupt Vector Data registers with the data shown in FIGURE 19-1. See Section 11.10.4, “Incoming Interrupt Vector Data” on page 122.
Chapter 19
UltraSPARC-IIi PCI Control and Status
313
.
63 Interrupt Rcv Data 0: 1: 2:
FIGURE 19-1
1110 0 0 0
Interrupt Vector Data Registers Contents
0 INR
INR is an 11 bit interrupt number that indicates the source of the interrupt. Where possible, the interrupt is precise (that is, it points to only one interrupt source). This singularity permits the dispatch of the proper interrupt service routine without any register polling. Bits [11] through [63]of the first word are guaranteed to be 0 for all UltraSPARC-IIi IO generated interrupts. Words 1 and 2 of the interrupt packet are also guaranteed to be 0. Each interrupt source has a mapping register, containing the INR value used for the interrupt. The INR has two parts: IGN and INO. The Interrupt Group Number (IGN) is the upper 5 bits of the INR, and for most interrupts is 0x1f.
Compatibility Note – The IGN on UltraSPARC-IIi is not programmable for the
Partial Interrupt Mapping Registers, and is fixed to 0x1f. The lower 6 bits of the INR are the Interrupt Number Offset (INO). This value is hardcoded by UltraSPARC-IIi for each interrupt source, as shown in TABLE 19-28, and is read-only in the mapping register. For PCI slot interrupt mapping registers, INO is always read as 00. For Graphics (FFB) and UPA64S expansion interrupts, the full 11-bit INR field is writable, and under software control.
Interrupt Number Offset Assignments
INO (hex) Interrupt Source
TABLE 19-28 INO (binary)
0bssnn
00-1F
PCI Bus b Slot ss Interrupt nn b = 0 for bus A, 1 for bus B ss = 00-11 for bus A or B slots, nn = 00-11 for INTA#,INTB#,INTC#,INTD# SCSI Ethernet
100000 100001
20 21
314
UltraSPARC-IIi User’s Manual • October 1997
TABLE 19-28 INO (binary)
Interrupt Number Offset Assignments (Continued)
INO (hex) Interrupt Source
100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 111111
22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 3F
Parallel port Audio Record Audio Playback Power Fail Keyboard/mouse/serial Floppy Reserved (spare HW int) Keyboard Mouse Serial Reserved Reserved DMA UE DMA CE PCI Bus Error Reserved Reserved Reserved
Each interrupt source has an associated state register that can be either of type “level” or of type “pulse.” In the level sensitive case, the state register has two bits and there are three valid states: IDLE, RECEIVED, and PENDING.
s s
s
IDLE: No interrupt in progress. RECEIVED: An Interrupt has been detected and will be delivered to the processor if the valid bit is set in the mapping register. PENDING: Interrupt has been delivered to the UltraSPARC-IIi core. Any subsequent detection of the same interrupt is ignored until software resets the state machine back to IDLE.
Software can set the state register for each level sensitive interrupt to any of these states using the Clear Interrupt Registers.
Chapter 19
UltraSPARC-IIi PCI Control and Status
315
In the pulse case, the state register consists of a single bit, with two states: IDLE and RECEIVED. These states have the same meaning as those for the level sensitive case. There is no PENDING state, so the state machine transitions from RECEIVED back to IDLE when the interrupt is dispatched to a processor. Diagnostic access is provided to allow software to read the state register for all interrupt sources.
Compatibility Note – There is no RECEIVED state for DMA CE, DMA UE, or PCI Error Interrupts. They cause their interrupt FSMs to go from the IDLE to the PENDING state directly, when present and enabled.
19.3.3.1
Partial Interrupt Mapping Registers
The offset of each partial Interrupt Mapping Register can be derived from the associated INO. There are two cases:
PCI Interrupts: IMR address = 0x1FE.0000.0C00 + (INO & 0x3C)
00 - IDLE state; no interrupt received or pending. 01 - RECEIVED state; interrupt detected, but not dispatched. 11 - PENDING state; interrupt is received and dispatched. 10 - Illegal state.
TABLE 19-37 Field
Pulse Interrupt State Assignment
Description
INT_STATE
0 - IDLE state; no interrupt received 1 - RECEIVED state; interrupt detected, but not dispatched.
Definitions of the registers are shown in a general way in the table below. Refer to the CODE EXAMPLE 19-1 above for specific bit positions. As an example, the bit position for PCI Bus B Slot 1, INTB# is ..
PCI Interrupt State Diagnostic Register Definition
Description
TABLE 19-38 Bits
7:0 15:8 23:16 31:24 39:32 47:40 55:48 63:56
PCI Bus A Slot 0 INT# DCBA PCI Bus A Slot 1 INT# DCBA PCI Bus A Slot 2 INT# DCBA PCI Bus A Slot 3 INT# DCBA PCI Bus B Slot 0 INT# DCBA PCI Bus B Slot 1 INT# DCBA PCI Bus B Slot 2 INT# DCBA PCI Bus B Slot 3 INT# DCBA
Chapter 19
UltraSPARC-IIi PCI Control and Status
321
TABLE 19-39 Bits
OBIO and Misc Int Diag Reg Definition
Description
1:0 3:2 5:4 7:6 9:8 11:10 13:12 15:14 17:16 19:18 21:20 23:22 29:28 31:30 33:32 35:34 37:36 34 35 63:36
SCSI Int State Ethernet Int State Parallel Port Int State Audio Record Int State Audio Playback Int State Power Fail Int State Kbd/mouse/serial Int State Floppy Int State Spare HW Int State Keyboard Int State Mouse Int State Serial Int State DMA UE Int State DMA CE Int State PCI Error Int State Reserved (return 0 on read) Reserved (return 0 on read) Graphics Int State Expansion UPA64S Int State Reserved (return 0 on read)
Compatibility Note – Note the “Graphics Int State” and Expansion UPA64S Int
State” bits are moved from bits 38 and 39 (position in prior UltraSPARC systems) to bits 34 and 35 respectively.
19.3.4
PCI INT_ACK Generation
UltraSPARC-IIi can generate an interrupt acknowledge in response to a PCI Interrupt. Name: ASI_INT_ACK (Privileged)
322
UltraSPARC-IIi User’s Manual • October 1997
ASI: 0x7F, VA==0x1FF, VA== (any address to PCI)
PCI INT_ACK Register Format
TABLE 19-40
Bits
Field
DATA INT_ACK data from PCI
Use
RW R
BUSY: This bit is set when an interrupt vector is received. DATA: Data returned on PCI byte 0 during INT_ACK cycle. Non-privileged access to this register causes a privileged_action trap. The address generated on the PCI bus is equal to VA[31:0]) VA[23:21] should be set to specific values when the APB MAP_INTACK_A/B functions are enabled, to control the forwarding of the INT_ACK to the A or B bus. The particular VA[23:21] depends on the way IO space is divided, since the same mapping register is used in APB for IO space, and MAP_INTACK_A/B forwarding. VA[23:21] are don't care if the APB ROUTE_INTACK_A/B functions are used to hardwire the INT_ACK forwarding. All other VA[31:24],[20:0] can be random values; zeros are recommended. If software does anything other than a byte/halfword/word load with ASI_INT_ACK, UltraSPARC-IIi/APB operation is undefined. A byte load should be correct for most systems. All error logging and events for PCI loads apply equally to this INT_ACK cycle generated by UltraSPARC-IIi.
19.4
PCI Address Space
PCI devices can be connected directly to the UltraSPARC-IIi PCI bus. UltraSPARC-IIi can also be used with an external PCI bridge, the Advanced PCI Bridge (APB), that can connect to separate PCI A and PCI B PCI buses. UltraSPARC-IIi support of multiple PCI buses includes interrupt management and flexible address mapping.
Chapter 19
UltraSPARC-IIi PCI Control and Status
323
APB provides a generalized address decode facility and a flexible target address space definition for DMA. Both PCI A and B can each support four PCI devices. There are no separate UltraSPARC-IIi CSRs for the A and B buses created by APB but only the single set of CSRs for the PCI bus connected to UltraSPARC-IIi
19.4.1
PCI Address Space—PIO
Several regions of UltraSPARC-IIi’s physical address space are used to access devices on the PCI bus that it supports. For the non-block transfers, any legal combination of bits in the bytemask may be set (that is, arbitrary bytemasks for writes, aligned 1, 2, 4, 8 or 16 byte bytemasks for reads), within the size restrictions listed below. The PCI byte enables generated by UltraSPARC-IIi are identical to those generated by the UltraSPARC core. The PCI specification, version 2.1 requires AD[1:0] to point to the first byte enable for I/O writes. This requirement is not met by UltraSPARC-IIi during: s compression of byte or halfword stores (Ebit==0) or s use of the PSTORE instruction to generate random byte enables. Generally, software should use only normal, non-compressed loads and stores to I/O space, and UltraSPARC-IIi meets the AD[1:0] requirement for those instructions. Also note that UltraSPARC-IIi can generate multiple data beat Configuration Read or Writes.
Physical Address Space to PCI Space Mappings
PA[40:0] CPU Commands Supported PCI Commands Generated
TABLE 19-41
PCI Address Space
PCI Configuration Space PCI Bus I/O Space Do not use PCI Bus Memory Space
0x1FE.0100.00000x1FE.01FF.FFFF 0x1FE.0200.00000x1FE.02FF.FFFF 0x1FE.0300.00000x1FE.FFFF.FFFF 0x1FF.0000.00000x1FF.FFFF.FFFF
NC read (any) NC write (any) NC read (any) NC write (any)
Configuration Read Configuration Write (may also be Special Cycle) I/O Read I/O Write May wrap to Configuration or I/O Space behavior
NC NC NC NC NC NC
read (4 byte) read (8 byte) Block read write Block write Instruction fetch
Memory Memory Memory Memory Memory Memory
Read Read Multiple Read Line Write Write Read
324
UltraSPARC-IIi User’s Manual • October 1997
Note – All PCI address spaces use little-endian address byte ordering. Any accesses made to a PCI address space should use one of the SPARC V9 little-endian support mechanisms to get proper byte ordering. These mechanisms include little-endian ASIs or MMU support for marking pages little-endian
19.4.1.1
PCI Configuration Space
PCI configuration cycles can be generated by UltraSPARC-IIi in response to PIO reads and writes to addresses in the PCI Configuration Space. UltraSPARC-IIi generates both Type 0 and Type 1 configuration cycles. Type 0 configuration cycles are used to configure devices on the UltraSPARC-IIi primary PCI bus, including APB. Type 1 configuration cycles are used to configure devices on secondary PCI busses via APB. UltraSPARC-IIi does not implement either of the two means of generating PCI configuration cycles defined by the PCI Specification but instead uses the following means:
An UltraSPARC-IIi PIO causes a type 0 configuration cycle on the primary PCI bus if PA[32:24] equals 0x001 and PA[23:16] (Bus Number) equals 0, and the Device Number is not 0. A Device Number of 0 designates the PBM itself, and the configuration cycle does not appear on the PCI bus.
FIGURE 19-2 shows how address bits 15:0 map to the PCI configuration cycle address.
32
24 23 000000001 Bus Number
16 15
11 10
8 7
2 1 0
0 0
Device Function Number Number
Register Number
Configuration Space Address 31 2Device Number (Only one ‘1’) 11 10 8 7 2 1 0 0 1
Function Number
Register Number
PCI Configuration Cycle Address
FIGURE 19-2
Type 0 Configuration Address Mapping
Chapter 19
UltraSPARC-IIi PCI Control and Status
325
The UltraSPARC-IIi PCI bus has no IDSEL# pins so device IDSEL# lines must be resistively tied to individual AD[31:11] lines. It is recommended that slot 0 be device 1, tied to AD[12]; slot 1 be device 2; tied to AD[13], and so on.
Compatibility Note – The UltraSPARC-IIi PCI bus is hardwired to Bus
Number == 0 A type 1 configuration cycle is generated when the bus number field of the configuration space address is not zero ( that is, the UltraSPARC-IIi Bus Number). The type 1 configuration cycle address is constructed from the configuration space address as shown in FIGURE 19-3 .
32 24 23 000000001 Bus Number 16 15 11 10 8 7 2 1 0 0 0
Device Function Number Number
Register Number
Configuration Space Address 31 Reserved 24 23 Bus Number 16 15 11 10 8 7 2 1 0 0 0
Device Function Number Number
Register Number
PCI Configuration Cycle Address
FIGURE 19-3
Type 1 Configuration Address Mapping
Note – APB looks at type 0 and type 1 configuration cycle addresses, and either
routes type 1 transactions to one of the secondary busses, or to its own configuration space. See the APB User’s Manual for details.
Compatibility Note – UltraSPARC-IIi aliases Functions 1-7 of its PCI Configuration space to its Function 0 PCI Configuration space. (Bus 0, Device 0). The PCI specification requires that zeros be returned and stores ignored. Since this address space is only accessible to UltraSPARC-IIi PIO instructions, specifically boot PROM code, this aliasing should not be problematic because the boot PROM should never reference the UltraSPARC-IIi Function 1-7 addresses (see “Type 0 Configuration Address Mapping” on page 325 for the address decode scheme).
326
UltraSPARC-IIi User’s Manual • October 1997
19.4.1.2
PCI I/O Space
PCI I/O cycles are generated by UltraSPARC-IIi in response to PIO reads and writes to addresses in one of the PCI I/O Spaces (one for each bus). For each access to I/O space, an I/O Read or I/O Write command is issued on the appropriate PCI bus. Bits 31:24 of the address on the PCI bus will be 0, and bits 23:0 will be a copy of physical address bits 23:0.
Note – It is expected that all PCI resources will be mapped by software into PCI
Memory space, and not PCI I/O space. UltraSPARC-IIi does provide a larger I/O space than did prior PCI-based UltraSPARC systems, so that devices that do use I/O space can be mapped to separate 8K pages for easier driver maintenance.
19.4.1.3
PCI Memory Space
PCI Memory cycles are generated by UltraSPARC-IIi in response to PIO reads and writes to addresses in one of the PCI Memory Spaces. As a bus master, UltraSPARC-IIi will never generate Dual-Address-Cycles; all PCI addresses generated will be bits [31:0] of the 41 bit UltraSPARC-IIi physical address. The memory command used for the PCI transaction depends on the PIO transaction type, as shown inTABLE 19-41. For PCI transactions with multiple data phases, UltraSPARC-IIi will always use Linear Incrementing mode as defined by the PCI specification. Cache Line Toggle Mode is not used.
Compatibility Note – Unlike prior PCI-based UltraSPARC systems, UltraSPARC-IIi does not use bit 31 of the PCI address for outgoing memory transactions, or bit 17 for outgoing IO transactions. APB also similarly preserves bits 31 and 17.
19.4.2
19.4.2.1
PCI Address Space—DMA
PCI Configuration Space
UltraSPARC-IIi does not respond to any Configuration Read or Configuration Write cycles. UltraSPARC-IIi/APB is the central resource for each PCI bus, and is expected to be the only device generating configuration cycles.
Chapter 19
UltraSPARC-IIi PCI Control and Status
327
UltraSPARC-IIi PIO accesses to target configuration registers within the PBM are serviced without generating a configuration cycle on the PCI bus. Peer-to-peer transfers between two PCI devices on the same bus using Configuration Read or Configuration Write commands cannot be prohibited by UltraSPARC-IIi or APB, but are not expected to occur, since UltraSPARC-IIi/APB are the only devices that can drive the IDSEL# lines correctly.
19.4.2.2
PCI I/O Space
UltraSPARC-IIi does not respond to I/O Read or I/O Write commands on the PCI bus. Peer-to-peer transfers between two PCI devices on the same bus using I/O Read or I/O Write commands cannot be prohibited by UltraSPARC-IIi, but they are not expected to occur, since all PCI resources are intended to be mapped into Memory Space.
19.4.2.3
PCI Memory Space
DMA, DMA (IOMMU bypass), and PCI peer-to-peer activity occurs in PCI Memory Space.The final destination and address translation of a PCI Memory transaction is based on these functions:
s s
Addressing mode used: 64-bit (DAC) vs. 32-bit (SAC) Whether the PCI address[31:29] is enabled as UltraSPARC-IIi address space, by the PCI Target Address Space Register. Value of MMU_EN in the IOMMU Control Register Value of PCI address bits in DAC mode
s s
The TABLE 19-42 shows the various ways that UltraSPARC-IIi deals with PCI addresses as a PCI target device.
PCI DMA Modes of Operation
MMU_EN Addr Result
TABLE 19-42
Mode
Target Space Hit
SAC SAC
no yes
X 0
N/A N/A
PCI peer-to-peer (Ignored by UltraSPARC-IIi) Pass-through
328
UltraSPARC-IIi User’s Manual • October 1997
TABLE 19-42
PCI DMA Modes of Operation
MMU_EN Addr Result
Mode
Target Space Hit
SAC DAC DAC
yes X X
1 X X
N/A 0x00000x3FFE 0x3FFF
IOMMU Translation (DMA) Ignored by UltraSPARC-IIi Bypass (DMA)
Pass-through
In pass-through mode, physical addr = 0x000, physical addr = PCI_Addr. Pass-through transfers always generate cacheable transactions.
Compatibility Note – Unlike prior PCI-based UltraSPARC systems, Pass-through
does not zero PCI_Addr[31]
IOMMU Translation mode
In IOMMU translation mode, the physical address is obtained by performing a virtual to physical translation through the IOMMU. The value of the C bit in the TTE for the virtual page determines whether the transaction generated is cacheable or non-cacheable.
PCI peer-to-peer mode
In peer-to-peer mode, two devices on the same PCI bus transfer data without any involvement from UltraSPARC-IIi. There is no address translation involved – the master device simply puts out the PCI address to which the target device has been mapped. If no device has been mapped there, the PCI master device terminates its cycle with a Master-Abort.
Bypass mode
In bypass mode, the physical address = PCI_Addr. Whether a cacheable or non-cacheable transaction is made is determined by the value of PCI_Addr; a 0 in this bit specifies a cacheable transaction.
Chapter 19
UltraSPARC-IIi PCI Control and Status
329
Compatibility Note – Prior PCI-based UltraSPARC systems used PCI_Addr,
but note that [40:34] are all 1’s for UPA64S addresses.
19.4.2.4
Memory Burst Order
In all cases, UltraSPARC-IIi only supports bursts as a target device in Linear Incrementing mode. If any of the reserved burst orders are used, UltraSPARC-IIi will issue a target disconnect after the first data phase.
19.4.3
DMA Error Registers
TABLE 19-43 Register
DMA Error Registers
PA Access Size
DMA UE AFSR DMA CE AFSR DMA UE/CE AFAR
0x1FE.0000.0030 0x1FE.0000.0040 0x1FE.0000.0038 or 0x1FE.0000.0048
8 bytes 8 bytes 8 bytes
19.4.3.1
DMA UE Asynchronous Fault Status/Address Register
UltraSPARC-IIi IO logs any uncorrectable ECC error that it detects in the DMA UE AFSR/AFAR. Uncorrectable errors can result from DMA read or DMA partial writes when memory does not Read-Modify-Write because it does not see an entire 16-bytes of write data. IOMMU errors can result from any DMA operation. This register contains primary error status bits and secondary error status bits . Only one of the primary error status bits can be set at any time. Primary error status can only be set when
s
None of the primary error conditions exists prior to this error or A new error is detected at the same time as software is clearing the primary error; “at the same time” means on coincident clock cycles. Setting takes precedence over clearing.
s
Secondary bits are set whenever a primary bit is set. The secondary bits are cumulative and always indicate that information has been lost because no address information has been captured. Setting of the primary error bits is independent.
330
UltraSPARC-IIi User’s Manual • October 1997
Compatibility Note – A PCI DMA UE interrupt is generated whenever a primary
DMA UE or Translation Error bit is set, even if by a CSR write. Ensure that software clears the AFSR before clearing the interrupt state and re-enabling the PCI Error Interrupt. (This behavior is similar to that of the ECU AFSR).
TABLE 19-44
DMA UE AFSR
Bits Description POR state Type
Field
Reserved P_DRD P_DWR Reserved S_DRD S_DWR S_DTE P_DTE Reserved BYTEMASK DW_OFFSET Reserved BLK Reserved
63 62 61 60 59 58 57 56 55:48 47:32 31:29 28:24 23 22:0
Read as 0 Set if primary DMA UE or TE is caused by PCI read Set if primary DMA UE or TE is caused by PCI write Reserved, read as 0 Set if secondary DMA UE or TE is caused by PCI read. Set if secondary DMA UE or TE is caused by PCI write Set if secondary error is PCI DMA Translation Error Set if primary error is PCI DMA Translation Error Read as 0 0x00FF or 0xFF00, depending on [29] ==0 or 1 DMA UE/CE AFAR bits [5:3] Read as 0 Set if primary error is caused by PCI read Reserved, read as 0
0 0 0 0 0 0 0 0 0 00FF 0 0 0 0
R0 R/ W1C R/ W1C R0 R/ W1C R/ W1C R/ W1C R/ W1C R0 R R R0 R R0
The AFAR and bits of AFSR log the address and status of the primary DMA UE or error. A new DMA UE error is not logged into these bits until software clears the primary error to make the AFAR and part of the AFSR available to log the new error.
Chapter 19
UltraSPARC-IIi PCI Control and Status
331
UltraSPARC-IIi extension to DMA UE AFSR operation
To facilitate debug, errors due to invalid TTE entries in the IOMMU TSB or write protection errors are also logged in the DMA UE AFSR and AFAR. See the shaded entries in AFSR TABLE 19-44.
Compatibility Note – This feature is absent in prior PCI-based UltraSPARC
systems but should be compatible with existing Solaris code. The DWR, DRD bits, and a new bit, DTE, are set for this new case. Software should also get an error report from the DMA master that receives the Target Abort. This action provides the advantage of getting t the VA of the error in the DMA UE AFAR. Since this error indicates a software problem with the IOMMU TSB, software should be able to sort out the two possible error indications. Note that the STA bit in the PCI Configuration Space Status register is also set, since UltraSPARC-IIi generated a Target Abort.
19.4.3.2
DMA UE/CE Asynchronous Fault Address Register
The AFAR and bits of AFSR log the address and status of the primary DMA UE or IOMMU error, and of the primary DMA CE. After logging an address associated with a primary DMA UE, a further DMA UE error is not logged until software clears the DMA UE AFSR primary UE or IOMMU error bits, to make the AFAR and part of the AFSR available to log a new error. This AFAR is also used for primary DMA CE address logging. Further DMA CE are not logged into these bits until software clears the primary error to make the AFAR and part of the AFSR available to log a new error. DMA UE or IOMMU errors, however, can always overwrite a value saved by a DMA CE primary error.The PA of the TTE entry is saved on Invalid, Protection (IOMMU miss), and TTE UE errors. If
332
UltraSPARC-IIi User’s Manual • October 1997
the Protection error had an IOMMU hit, the translated PA from the IOMMU is saved instead. This may occur if a prior DMA read caused the IOMMU entry to be installed.
DMA UE/CE AFAR
Bits Description POR state Type
TABLE 19-45
Field
Reserved UE/CE_PA 0
63:41 40:0 2:0
Reserved, read as 0. Physical address of error transaction. Always 0
0 0 0
R0 R R0
19.4.3.3
DMA CE Asynchronous Fault Status/Address Register
UltraSPARC-IIi logs the correctable ECC error in the DMA CE AFSR/AFAR. Correctable errors can occur during DMA read or DMA partial write operations. This register contains primary error status bits and secondary error status bits . Only one of the primary error status bits can be set at any time. Primary error status can be set only when
s
None of the primary error conditions exists prior to this error or A new error is detected at the same time as software is clearing the primary error; “at the same time” means on coincident clock cycles. Setting takes precedence over clearing.
s
Secondary bits are set whenever a primary bit is set. The secondary bits are cumulative and always indicate that information has been lost because no address information has been captured. Setting of the primary error bits is independent.
Chapter 19
UltraSPARC-IIi PCI Control and Status
333
Compatibility Note – A DMA CE interrupt is generated whenever a primary DMA CE bit is set, even if by a CSR write. Ensure that software clears the AFSR before it clears the interrupt state and re-enables the PCI Error Interrupt. (This behavior is similar to that of the ECU AFSR).
TABLE 19-46
DMA CE AFSR
Bits Description POR state Type
Field
Reserved P_DRD P_DWR Reserved S_DRD S_DWR Reserved E_SYND BYTEMASK DW_OFFSET Reserved BLK Reserved
63 62 61 60 59 58 57:56 55:48 47:32 31:29 28:24 23 22:00
Reserved, read as 0 Set if primary DMA CE is caused by PCI read Set if primary DMA CE is caused by PCI write Reserved, read as 0 Set if secondary DMA CE is caused by PCI read. Set if secondary DMA CE is caused by PCI write Reserved, read as 0 DMA CE Syndrome bits, logged on primary error. 0x00FF or 0xFF00, depending on [29] ==0 or 1 DMA UE/CE AFAR bits [5:3] Read as 0 Set if primary error is caused by PCI read Reserved, read as 0
0 0 0 0 0 0 0 0 00FF 0 0 0 0
R0 R/ W1C R/ W1C R0 R/ W1C R/ W1C R0 R R R R0 R R0
334
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
20
SPARC-V9 Memory Models
20.1
Overview
SPARC-V9 defines the semantics of memory operations for three memory models. From strongest to weakest, they are Total Store Order (TSO), Partial Store Order (PSO), and Relaxed Memory Order (RMO). The differences in these models lie in the freedom an implementation is allowed in order to obtain higher performance during program execution. The purpose of the memory models is to specify any constraints placed on the ordering of memory operations in uniprocessor and shared-memory multi-processor environments. UltraSPARC-IIi supports all three memory models. Although a program written for a weaker memory model potentially benefits from higher execution rates, it may require explicit memory synchronization instructions to function correctly if data is shared. MEMBAR is a SPARC-V9 memory synchronization primitive that enables a programmer to control explicitly the ordering in a sequence of memory operations. Processor consistency is guaranteed in all memory models. The current memory model is indicated in the PSTATE.MM field. It is unaffected by normal traps, but is set to TSO (PSTATE.MM=0) when the processor enters RED_state. A memory location is identified by an 8-bit Address Space Identifier (ASI) and a 64bit virtual address. The 8-bit ASI may be obtained from a ASI register or included in a memory access instruction. The ASI is used to distinguish between and provide an attribute for different 64-bit address spaces. For example, the ASI is used by the UltraSPARC-IIi MMU and memory access hardware to control virtual-to-physical address translations, access to implementation-dependent control and data registers, and for access protection. Attempts by non-privileged software (PSTATE.PRIV=0) to access restricted ASIs (ASI=0) cause a privileged_action trap.
335
Memory is logically divided into real memory (cached) and I/O memory (noncached with and without side-effects) spaces. Real memory spaces can be accessed without side-effects. For example, a read from real memory space returns the information most recently written. In addition, an access to real memory space does not result in program-visible side-effects. In contrast, a read from I/O space may not return the most recently written information and may result in program-visible sideeffects.
20.2
Supported Memory Models
The following sections contain brief descriptions of the three memory models supported by UltraSPARC-IIi. These definitions are for general illustration. Detailed definitions of these models can be found in The SPARC Architecture Manual, Version 9. The definitions in the following sections apply to system behavior as seen by the programmer. A description of MEMBAR can be found in Section 8.3.2, “Memory Synchronization: MEMBAR and FLUSH” on page 72
Note – Stores to UltraSPARC-IIi Internal ASIs, block loads, and block stores are
outside the memory model; that is, they need MEMBARs to control ordering. See Section 8.3.8, “Instruction Prefetch to Side-Effect Locations” on page 79 and Section 13.5.3, “Block Load and Store Instructions” on page 172.
Note – Atomic load-stores are treated as both a load and a store and can only be
applied to cacheable address spaces.
20.2.1
TSO
UltraSPARC-IIi implements the following programmer-visible properties in Total Store Order (TSO) mode:
s
s
s
Loads are processed in program order; that is, there is an implicit MEMBAR #LoadLoad between them. Loads may bypass earlier stores. Any such load that bypasses such earlier stores must check (snoop) the store buffer for the most recent store to that address. A MEMBAR #Lookaside is not needed between a store and a subsequent load at the same noncacheable address. A MEMBAR #StoreLoad must be used to prevent a load from bypassing a prior store, if Strong Sequential Order is desired.
336
UltraSPARC-IIi User’s Manual • October 1997
s s s
s
Stores are processed in program order. Stores cannot bypass earlier loads. Accesses with the E-bit set (that is, those having side-effects) are all strongly ordered with respect to each other. An E-cache update is delayed on a store hit until all outstanding stores reach global visibility. For example, a cacheable store following a noncacheable store is not globally visible until the noncacheable store has reached global visibility; there is an implicit MEMBAR #MemIssue between them.
20.2.2
PSO
UltraSPARC-IIi implements the following programmer-visible properties in Partial Store Order (PSO) mode:
s
s
s s
s
Loads are processed in program order; that is, there is an implicit MEMBAR #LoadLoad between them. Loads may bypass earlier stores. Any such load that bypasses such earlier stores must check (snoop) the store buffer for the most recent store to that address. For SPARC-V9 compatibility, a MEMBAR #Lookaside should be used between a store and a subsequent load to the same non-cacheable address. Stores cannot bypass earlier loads. Stores are not ordered with respect to each other. A MEMBAR must be used for stores if stronger ordering is desired. A MEMBAR #MemIssue is needed for ordering of cacheable after non-cacheable stores. Non-cacheable accesses with the E-bit set (that is, those having side-effects) are all strongly ordered with respect to each other, but not with non-E-bit accesses.
Note – The behavior of partial stores to noncacheable addresses (pages with the
TTE.CP=0) is dependent on the system and I/O device implementation. UltraSPARC-IIi generates a P_NCWR_REQ operation with a byte mask corresponding to the rs2 mask of the partial store instruction. If the system interconnect or I/O device is unable to perform the write operation of the bytes specified by the byte mask, an error is not signaled back to the processor.
20.2.3
RMO
UltraSPARC-IIi implements the following programmer-visible properties in Relaxed Memory Order (RMO) mode:
s
There is no implicit order between any two memory references, either cacheable or non-cacheable, except that non-cacheable accesses with the E-bit set (that is, those having side-effects) are all strongly ordered with respect to each other.
Chapter 20 SPARC-V9 Memory Models 337
s
A MEMBAR must be used between cacheable memory references if stronger order is desired. A MEMBAR #MemIssue is needed for ordering of cacheable after non-cacheable accesses. A MEMBAR #Lookaside should be used between a store and a subsequent load at the same noncacheable address.
338
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
21
Code Generation Guidelines
21.1
Hardware / Software Synergy
One of the goals set for UltraSPARC-IIi was for the processor to execute SPARC-V8 binaries efficiently, providing approximately three times the performance of existing machines running the same code. A significantly larger performance gain can be obtained if the code is re-compiled using a compiler specifically designed for UltraSPARC-IIi. Several features are provided on UltraSPARC-IIi that can only be taken advantage of by using modern compiler technology. This technology was not available previously, mainly because the hardware support was not sufficient to justify its development.
21.2
21.2.1
Instruction Stream Issues
UltraSPARC-IIi Front End
The front end of the processor consists of the Prefetch Unit, the I-cache, the next field RAM, the branch and set prediction logic, and the return address stack. The role of the front end is to supply as many valid instructions as possible to the grouping logic and eventually to the functional units (the ALUs, floating-point adder, branch unit, load/store pipe, etc.).
339
21.2.2
21.2.2.1
Instruction Alignment
I-cache Organization
The 16 Kb I-cache is organized as a 2-way set associative cache, with each set containing 256 eight-instruction lines ( FIGURE 21-1). The 14 bits required to access any location in the I-cache are composed of the 13 least significant address bits (since the minimum page size is 8K, these 13 bits are always part of the page offset and need not be translated) and one bit used to predict the associativity number (way) in which instructions reside. Out of a line of 8 instructions, up to 4 instructions are sent to the instruction buffer, depending on the address. If the address points to one of the last three instructions in the line, only that instruction and the ones (0-2) until the end of the line are selected (for simplicity and timing considerations, hardware support for getting instructions from two adjacent lines was not included). Consequently, on average for random accesses, 3.25 instructions are fetched from the I-cache. For sequential accesses, the fetching rate (4 instructions per cycle) equals or exceeds the consuming rate of the pipeline (up to 4 instructions per cycle).
SET 1 SET 0
256 LINEs
8 instructions
32 bytes
FIGURE 21-1
I-cache Organization
21.2.2.2
Branch Target Alignment
Given the restriction mentioned above regarding the number of instructions fetched from an I-cache access, it is desirable to align branch targets so that enough instructions are fetched to match the number of instructions issued in the first group of the branch target. For instance, if the compiler scheduler indicates that the target can only be grouped with one more instruction, the target should be placed
340
UltraSPARC-IIi User’s Manual • October 1997
anywhere in the line except in the last slot, since only one instruction would be fetched in that case. If the target is accessed from more than one place, it should be aligned so that it accommodates the largest possible group. If accesses to the I-cache are expected to miss, it may be desirable to align targets on a 16-byte (even 32-byte) boundary so that 4 instructions are forwarded to the next stage. Such an alignment can at least assure that four (eight for 32-byte alignment) instructions can be processed between cache misses, assuming that the code does not branch out of the sequence of instructions (which is generally not the case for integer programs).
21.2.2.3
Impact of the Delay Slot on Instruction Fetch
If the last instruction of a line is a branch, the next sequential line in the I-cache must be fetched even if the branch is predicted taken, since the delay slot must be sent to the grouping logic. This leads to inefficient fetches, since an entire E-cache access must be dedicated to fetching the missing delay slot. Take care not to place delayed CTIs (control transfer instructions) that are predicted taken at the end of a cache line.
21.2.2.4
Instruction Alignment for the Grouping Logic
UltraSPARC-IIi can execute up to four instructions per cycle. The first three instructions in a group occupy slots that in most cases are interchangeable with respect to resources. Only special cases of instructions that can only be executed in IEU1 followed by IEU0 candidates violate this interchangeability (described in Section 22.5, “Integer Execution Unit (IEU) Instructions” on page 362). The fourth slot can only be used for PC-based branches or for floating-point instructions. Consequently, in order to get the most performance out of UltraSPARC-IIi, the code should be organized so that either a floating-point operation (FPOP) or a branch is aligned with the fourth slot. For floating-point code, it should be relatively easy for the compiler to take advantage of the added execution bandwidth provided by the fourth slot. For integer code, aligning the branch so that it is issued fourth in a group must be balanced with other factors that may be more important, such as not placing a branch at the end of a cache line. Moreover if dependency analysis shows that a group of four instructions could be issued, but the fourth instruction is not a branch or an FPop while one of the first three is a branch, before switching the two instructions (assuming no data dependency), the compiler must evaluate the following trade-off:
s
Moving the fourth instruction ahead of the branch (cross-block scheduling) and generating possible compensation code for the alternate path. Breaking the group and scheduling the ALU instruction with the next group. Notice that this may not lengthen the critical path (in terms of number of cycles executed) if the next group can accommodate this extra instruction without adding any new group.
s
Chapter 21
Code Generation Guidelines
341
21.2.2.5
Impact of Instruction Alignment on PDU
There is one branch prediction entry for every two instructions in the I-cache. Each entry, consisting of a two-bit field, indicates if the branch is predicted taken or nottaken (the state machine is described in Section 21.2.6). In addition to the branch prediction field, there is a next field associated with every four instructions. The next field contains the index of the line and the associativity number (or way) of the line that should be fetched next. For sequential code, the next field points to the next line in the I-cache. If a predicted taken branch is among the four instructions, the next field contains the index of the target of the branch. The following cases represent situations when the prediction bits and/or the next field do not operate optimally: 1. When the target of a branch is word 1 or word 3 of an I-cache line ( FIGURE 21-2) and the fourth instruction to be fetched (instruction 4 and 6 respectively) is a branch, the branch prediction bits from the wrong pair of instructions are used.
0
1
2
3
4
5
6
7
Odd Fetches
FIGURE 21-2
Odd Fetch to an I-cache Line
2. If a group of four instructions (instructions 0-3 or instructions 4-7) contains two branches and can be entered at a different position than the beginning of the group (other than instruction 0 and 4 respectively), the next field will contain the update from the latest branch taken in this group of four instructions, which may not be the one associated with the branch of interest ( FIGURE 21-3).
Entry Point
Branch
Entry Point
Branch Next Field
FIGURE 21-3
Next Field Aliasing Between Two Branches
3. Since there is one set of prediction bits for every two instructions, it is possible to have two branches (a CTI couple) sharing prediction bits. Under normal circumstances, the bits are maintained correctly; however, the bits may be updated based on the wrong branch if the second branch in the CTI couple is the target of another branch ( FIGURE 21-4).
342
UltraSPARC-IIi User’s Manual • October 1997
Entry Point
Branch Branch Prediction
FIGURE 21-4
Aliasing of Prediction Bits in a Rare CTI Couple Case
As stated in Chapter 22, “Grouping Rules and Stalls,” if the addresses of the instructions in a group cross a 32-byte boundary, an implicit branch is “forced” between instructions at address 31 and 32 (low order bits). That rule has a performance impact only if a branch is in that specific group. Care should be taken not to place a branch in a group that crosses this boundary. FIGURE 21-5 shows an example of this rule. A group containing instructions I0 (branch), I1, I2, and I3 will be broken, because an artificial branch is forced after address 31 and there is already a branch in the group.
Group Break Forced
I3 ..30
FIGURE 21-5
Branch ..31
I1 ..0
I2 ..1
I3 ..2
Artificial Branch Inserted after a 32-byte Boundary
21.2.3
I-cache Timing
If accesses to the I-cache hit, the pipeline will rarely starve for instructions. Only in pathological cases will the PDU be unable to provide a sufficient number of instructions to keep the functional units busy. For example, a taken branch to a taken branch sequence without any instructions between the branches (except for the delay slot) could only be executed at a peak rate of two instructions per cycle. Otherwise, up to 4 instructions are sent to the D Stage to be decoded and eventually dispatched in the G Stage and executed starting in the E Stage. An I-cache miss does not necessarily result in bubbles being inserted into the pipeline. Part of the I-cache miss processing, or even all of it, can be overlapped with the execution of instructions that are already in the instruction buffer and are waiting to be grouped and executed. Moreover, since the operation of the PDU is somewhat separated from the rest of the pipeline, the I-cache miss may have occurred when the pipeline was already stalled (for example, due to a multi-cycle integer divide, floating-point divide dependency, dependency on load data that missed the D-cache, etc.). This means that the miss (or part of it) may be transparent to the pipeline.
Chapter 21
Code Generation Guidelines
343
When an I-cache miss is detected, normal instruction fetching is disabled and a request is sent to the E-cache for the line that is missing in the I-cache. A full line of eight instructions (32 bytes) is brought into the processor in two parts (the interface to the E-cache is 16-bytes wide). The critical part (that is, the 16 bytes containing the instruction that caused the miss) is brought in first. If a predicted taken branch is in the second 16-byte block brought into the I-cache, there will be a one cycle delay before the next fetch (this is the time needed to compute the next address). Because of the possibility of stalling the processor for in the case when the pipeline is waiting for new instructions, it is desirable to try to make routines fit in the I-cache and avoid hot spots (collisions). UltraSPARC-IIi provides instrumentation to profile a program and detect if instruction accesses generate a cache miss or a cache hit. For example, one can program performance counters to monitor I-cache accesses and I-cache misses. Then, by checkpointing the counters before and after a large section of code, combined with profiling the section of code, one can determine if the frequently executed functions generally hit or miss the I-cache. Instrumentation can be used in a similar manner to determine if a trap handler generally resides in the I-cache or causes a cache miss.
21.2.4
Executing Code Out of the E-cache
When frequently executed routines do not fit in the I-cache, it is possible to organize the code so that the main routines reside in the much larger E-cache and do not significantly affect the execution time. As an example we look at fpppp. Of the fourteen floating-point programs in SPECfp92, fpppp shows the highest I-cache miss rate (about 21%) per cache access, or about 6.0% per instruction. For comparison, the next highest is doduc with about a 3% miss per cache access, 1% per instruction. Even though the I-cache miss rate is significant, UltraSPARC-IIi is barely affected by it (the impact is on CPI only 0.0084). It performs so well for the reasons:
s s s
s
The code is organized as a large sequential block. Branches are predicted very well (over 90%). The instruction buffer almost always contains several instructions when an I-cache miss occurs (an average of about 6.6). The instruction buffer is filled faster (up to 4 instructions per cycle) than it is emptied.
All these factors contribute to reducing the apparent I-cache miss latency to 0.14 cycles on average for fpppp; that is, on average, the pipeline is stalled for 0.14 cycles when an I-cache miss occurs. The effectiveness of the instruction buffer and the prefetcher on fpppp demonstrated that techniques (such as loop unrolling) that create large sequential blocks of code can be used efficiently on UltraSPARC-IIi, even if these blocks do not fit in the
344
UltraSPARC-IIi User’s Manual • October 1997
I-cache. On the other hand, for code properly scheduled to take advantage of the four issue slots on UltraSPARC-IIi, the rate of instruction “consumption” may easily exceed the rate of instruction fetching, thus making I-cache misses more apparent.
21.2.5
uTLB and iTLB Misses
The one-entry uTLB contains the virtual page number and the associated physical page number of the line accessed last. If the line currently accessed is to the same page, the instructions from that line are simply forwarded to the next stage. If the line is from a different virtual page, the translation is obtained from the iTLB a cycle later. The cost of crossing a page boundary is thus one cycle (the smallest possible page size, 8K bytes, is assumed). This may or may not translate into a one cycle penalty for the whole processor. For a tight loop with code spanning over two pages, this cost may be significant, especially if the instruction buffer is empty at the time of the page crossing. For this reason, it is desirable to position short loops within a page (avoid page crossing). An iTLB miss is handled by software through the use of the TSB, and takes about 32 cycles. Consequently, an iTLB miss may be very costly in terms of idle processor cycles. In order to minimize the frequency of iTLB misses, UltraSPARC-IIi provides a large number of entries (64) in the iTLB and allows pages as large as 4Mbytes to be used. Nonetheless, techniques that allocate pages based on profiling are encouraged to further decrease the iTLB miss cost.
21.2.6
Branch Prediction
UltraSPARC-IIi predicts the outcome of branches and fetches the next instructions likely to be executed based on that outcome. While this is all done dynamically in hardware, the compiler has an impact on the initialization of the state machine. The static bit provided by BPcc and FBPfcc instructions is used to set the state machine in either the likely taken state or the likely not taken state (FIGURE 21-6). For branches without prediction (Bicc, FBfcc), UltraSPARC-IIi initializes the state machine to likely not taken. Notice that a branch initialized to likely taken does not produce a correct next field for the immediately following I-cache fetch, since it takes one extra cycle to generate the correct address (branch offset added to the PC). This results in two lost cycles for fetching instructions, which does not necessarily lead to a pipeline stall. This penalty is much less than the mispredicted branch penalty (four cycles) that would occur if the branch prediction bit was always ignored and a static prediction were used (for example, always taken). The state machine representing the algorithm used for branch prediction is represented in FIGURE 21-6. Note that this figure is identical to that shown in FIGURE A-14 on page 392.
Chapter 21
Code Generation Guidelines
345
Initialization
PT/ANT PT/ANT PT,AT ST PT/AT LT PNT/AT PT: Predicted Taken PNT: Predicted Not Taken AT: Actual Taken ANT: Actual Not Taken
FIGURE 21-6
PNT/ANT LNT PNT/AT SNT
PNT/ANT
ST: Strongly Taken LT: Likely Taken SNT: Strongly Not Taken LNT: Likely Not Taken
Dynamic Branch Prediction State Diagram
For loops in steady state, the algorithm is designed so that it requires two mispredictions in order for the prediction to be changed from taken to not taken. Each loop exit will thus cause a single misprediction (versus two for a one-bit dynamic scheme).
21.2.6.1
Impact of the Annulled Slot
Grouping rules in Chapter 22, “Grouping Rules and Stalls,” describe how UltraSPARC-IIi handles instructions following an annulling branch. In connection with these instructions, pay regard to the rules:
s
s
s
s
Avoid scheduling multicycle instructions in the delay slot (for example, IMUL, IDIV, etc.). Avoid scheduling long latency instructions such as FDIV if the branch is predicted to be not-taken for a significant portion of the time (since they affect the timing of the non-taken stream). Avoid scheduling an instruction that would stall dispatching owing to a load-use dependency. Avoid scheduling WR(PR, ASR), SAVE, SAVED, RESTORE, RESTORED, RETURN, RETRY, and DONE in the delay slot and in the first three groups following an annulling branch.
21.2.6.2
Conditional Moves vs. Conditional Branches
The MOVcc and MOVR instructions provide an alternative to conditional branches for executing short code segments. UltraSPARC-IIi differentiates the two as follows:
346
UltraSPARC-IIi User’s Manual • October 1997
s
Conditional branches: the branches are always resolved in the C stage. Distancing the SETcc from Bicc does not gain any performance. The penalty for a mispredicted branch is always four cycles. SETcc, Bicc, and the delay slot can be grouped together (FIGURE 21-7).
setcc G bicc G delay G E E E C C C N1 N2 N3 W N1 N2 N3 W N1 N2 N3 W
FIGURE 21-7
Handling of Conditional Branches
s
Conditional moves: MOVcc and MOVR are dispatched as single instruction groups. Consequently, SETcc and MOVcc (or MOVR) cannot be grouped together (vs. SETcc and Bicc). Also, a use of the destination register for the MOVcc follows the same rule as a load-use (breaks group plus a bubble). FIGURE 21-8 shows a typical example.
setcc G movcc use E G C E N1 N2 N3 W C N1 N2 N3 W G E C N1 N2 N3 W
FIGURE 21-8
Handling of MOVCC
The use of FMOVR is more constrained than MOVcc. Besides having to wait for the load buffer to be empty, FMOVR and any younger IEU instructions must be separated by one group, even if there is no dependency between the IEU instruction and FMOVR. Assuming that a specific branch can only be predicted with 50% accuracy (basically, it is not predicted), the compiler must balance the two cycle penalty on average for the mispredicted branch case against the ability to schedule other instructions around MOVcc (the SETcc cycle and the two groups after MOVcc, since MOVcc is a single instruction group). The need for multiple MOVcc instructions to guard multiple operations also must be taken into account.
21.2.7
I-cache Utilization
Grouping blocks that are executed frequently can effectively increase the apparent size of the I-cache. Cache studies show that, often, half of the I-cache entries are never executed. Placing rarely executed code out of a line containing a frequently executed block (identified by profiling) achieves a better I-cache utilization.
Chapter 21
Code Generation Guidelines
347
21.2.8
Handling of CTI couples
UltraSPARC-IIi handles CTI couples by taking a “false” trap on the second CTI. It processes the first CTI, executes instructions until the second CTI reaches the N 3 stage, squashes all instructions executed after the first CTI, and executes instructions starting with the second CTI. Nine cycles are lost when CTI couples are encountered, which should discourage their use.
21.2.9
Mispredicted Branches
The dynamic branch prediction mechanism used for UltraSPARC-IIi can generally achieve a success rate of 87% for integer programs and around 93% for floatingpoint programs (SPEC92). Correctly predicted conditional branches allow the processor to group instructions from adjacent basic blocks and continue progress speculatively until the branch is resolved. The capability of executing instructions speculatively is a significant performance boost for UltraSPARC-IIi. On the other hand, when a branch is mispredicted, up to 18 instructions can be cancelled; This is the case when two instructions from the current group are cancelled along with 4 groups of 4 instructions, as shown in FIGURE 21-9 – costly but, fortunately, this one case is very rare.
bicc F D G E delay F D G E instr1F D G E instr2F D G E grp1 F D G grp2 F D grp3 F grp4 instr1 (correct) ...
FIGURE 21-9
C C C C E G D F
N1 N1 N1 N1 C E G D F ...
N2 N2 N2 N2 N1 C E G D
N3 N3 N3 N3 N2 N1 C E G
W W W W N3 N2 N1 C E
W N3 N2 N1 C
W N3 W N2 N3 W N1 N2 N3 W
Cost of a Mispredicted Branch (Shaded Area)
FIGURE 21-9 shows how expensive badly behaved branches are for UltraSPARC-IIi.
Special effort should be made to predict branches that follow highly predictable branches based on profiling, and to combining conditions to make branches more predictable. Finally, if two or more branches are found to be correlated, it may be advantageous to duplicate common blocks to obtain separate branch predictions for hard-to-predict branches. For example in FIGURE 21-10, if the outcome of branch A, that is executed before branch B, has an impact on the direction of branch B, then it is preferable to split the code and duplicate the branch.
348
UltraSPARC-IIi User’s Manual • October 1997
branch A
branch A
block 1
block 2
block 1
block 2
block 3
block 3
block 3
branch B
branch B
branch C
Predictable Hard to Predict
FIGURE 21-10
Predictable
Branch Transformation to Reduce Mispredicted Branches
The technique, shown in FIGURE 21-10, can be generalized to N levels, where N branches are correlated and become more predictable. The above technique may lead to unrolling of loops that were previously identified as bad candidates because of the unpredictable behavior of their conditional branches.
21.2.10
Return Address Stack (RAS)
In order to speed up returns from subroutines invoked through CALL instructions, UltraSPARC-IIi dedicates a 4-deep stack to store the return address. Each time a CALL is detected, the return address is pushed onto this RAS (Return Address Stack). Each time a return is encountered, the address is obtained from the top of the stack and the stack is popped. UltraSPARC-IIi considers a return to be a JMPL or RETURN with rs1 equal to %o7 (normal subroutine) or %i7 (leaf subroutine). The RAS provides a guess for the target address, so that prefetching can continue even though the address calculation has not yet been performed. JMPL or RETURN instructions using rs1 values other than %o7 or %i7, and DONE or RETRY instructions also use the value on the top of the RAS for continuing prefetching, but they do not pop the stack. See Section 17.1, “Overview” on page 261 for information about the contents of the RAS during RED_state processing.
Chapter 21
Code Generation Guidelines
349
21.3
21.3.1
Data Stream Issues
D-cache Organization
The D-cache is a 16K byte, direct mapped, virtually indexed, physically tagged (VIPT), write-through, non-allocating cache. It is logically organized as 512 lines of 32 bytes. Each line contains two 16-byte sub-blocks ( FIGURE 21-11).
sub-block 0
sub-block 1
512 lines
16 bytes
FIGURE 21-11
16 bytes
Logical Organization of D-cache
21.3.2
D-cache Timing
The latency of a load to the D-cache depends on the opcode. For unsigned loads, data can be used two cycles after the load. For instance, if the first two instructions in the instruction buffer are a load and an instruction dependent on that load, the grouping logic will break the group after the load and a bubble will be inserted in the pipeline the following cycle. Code compiled for an earlier SPARC processor with a load use penalty of one cycle will show a penalty of about.one CPI just for this rule; thus, it is very important to separate loads from their use.
350
UltraSPARC-IIi User’s Manual • October 1997
21.3.2.1
Signed Loads
All signed loads smaller than 64 bits must be separated from their use by three cycles; otherwise, an extra bubble is inserted in the pipeline to force the separation between the load and its use. Floating-point loads are not sign extended, so they have a latency of two cycles. Once a signed load (smaller than 64 bits) is encountered in the instruction stream, all subsequent consecutive loads (signed or unsigned) also return data in three cycles; otherwise, there would be a collision between two loads returning data. As soon as a cycle without a load appears in the pipeline, the latency of loads is brought back to two cycles.
Note – The SPARC-V8 LD instruction is replaced with LDUW in SPARC-V9; the
new instruction does not require sign extension.
21.3.3
Data Alignment
SPARC-V9 requires that all accesses be aligned on an address equal to the size of the access. Otherwise a mem_address_not_aligned trap is generated. This is especially important for double precision floating-point loads, which should be aligned on an 8-byte boundary. If misalignment is determined to be possible at compile time, it is better to use two LDF (load floating-point, single precision) instructions and avoid the trap. UltraSPARC-IIi supports single-precision loads mixed with doubleprecision operations, so that the case above can execute without penalty (except for the additional load). If a trap does occur, UltraSPARC-IIi dedicates a trap vector for this specific misalignment, which reduces the overall penalty of the trap. Grouping load data is desirable, since a D-cache sub-block can contain either four properly aligned single-precision operands or two properly aligned double-precision operands (eight and four respectively for a D-cache line). As we shall see later, this is desirable not only for improving the D-cache hit rate (by increasing its utilization density), but also for D-cache misses where, for sequential accesses, one out of two requests to the E-cache can be eliminated. Grouping load data beyond a D-cache sub-block is also desirable, since an E-cache line contains four D-cache sub-blocks (for a total of 64 bytes). Thus, sequential accesses can guarantee that only one E-cache miss will occur for loads that access up to four consecutive D-cache subblocks (two D-cache lines). Section 21.3.6 discuss how code scheduled for accessing data directly out of the E-cache can hide the extra latency introduced by D-cache misses.
Chapter 21
Code Generation Guidelines
351
Data alignment (right justification) for byte, halfword, and word accesses does not add latency to the loads unless superseded by the sign rule described in Section 21.3.2.1, “Signed Loads”. This is true whether the load goes to the register file or to internal pipeline bypasses.
21.3.4
Direct-Mapped Cache Considerations
A direct-mapped cache is more susceptible to collisions than a set-associative cache. It is possible to organize data at compile time so that collisions are minimized, however. For frequently executed loops, the compiler should organize the data so that all accesses within the loop are mapped to different cache lines, unless the access is to a line that is already mapped and the access is to the same physical line. For UltraSPARC-IIi, this means that accesses should differ in the virtual address bits VA. Hot spots can be detected by configuring the on-chip counters to accumulate D-cache accesses and D-cache misses. The counters can be turned on/off before/after the load of interest, or around a series of loads where hot spots are suspected to occur.
21.3.5
D-cache Miss, E-cache Hit Timing
Under normal circumstances (for example, no snoops, no arbitration conflict for the E-cache bus), loads that hit the E-cache are returned N cycles later than loads that hit the D-cache, where N is determined by the E-cache SRAM mode. TABLE 21-1 shows the latency for all supported SRAM Modes. (See Section 1.3.3.1, “E-Cache SRAM Modes” on page 6 for more information.
D-cache Miss, E-cache Hit Latency Depends on SRAM Mode
SRAM Modes 2-2-2 No. of Cycles 2–2
TABLE 21-1
9
7
If such a load (D-cache miss, E-cache hit) is immediately followed by a use, the group is broken and an (N+1)-cycle stall occurs; PIPELINE EXAMPLE 21-1 illustrates this situation. (The figure shows a 8-cycle stall, which is consistent with 2–2 mode; 2–2–2 mode incurs a 10-cycle stall.)
352
UltraSPARC-IIi User’s Manual • October 1997
PIPELINE EXAMPLE 21-1
load r1 use r1 W
F F
D D
G G
D-cache Miss, E-cache Hit (2–2 mode shown) E C N1 Q Q Q Q Q Q G E E E E E E E E E
C
N1
N2
N3
Group Break
(N+1)-Cycle Stall
Execution Resumes
Because of the high penalty associated with a load miss for code scheduled based on loads hitting the D-cache, UltraSPARC-IIi provides hardware support for nonblocking loads through a load buffer that allows code scheduling based on External Cache (E-cache) hits.
21.3.6
Scheduling for the E-cache
Some applications have a working set that is too large to fit within the D-cache (they cause many capacity misses); others use data in patterns that generate many conflictmisses. Compilers c an schedule these applications to “bypass” the D-cache and access the data out of the E-cache. Loads that miss the D-cache do not necessarily stall the pipeline (non-blocking loads). Instead, they are sent to the load buffer, where they wait for the data to be returned from the E-cache. The pipeline stalls only when an instruction that is dependent on the non-blocking load enters the pipeline before the load data is returned.
21.3.6.1
Mixing D-cache Misses and D-cache Hits
The UltraSPARC-IIi “golden rule” is that all load data are returned in order. For instance if a load misses the D-cache, enters the load buffer, and is followed by a load that hits the D-cache, the data for the second (younger) load is not accessible. In this case, the younger load also must enter the load buffer; it will access the D-cache array only after the older load (D-cache miss) does so. If the load buffer is not empty, the D-cache array access is decoupled from the D-cache tag access; that is, it is performed some cycles after the tag access.
Note – Accessing blocked data in the D-cache while there is a load in the load buffer
and scheduling the code so that operations can be performed on the blocked load data is not supported on UltraSPARC-IIi. Data is always returned and operated upon in order.
PIPELINE EXAMPLE 21-2 on page 354 clarifies what is not supported without stalls on
UltraSPARC-IIi.
Chapter 21
Code Generation Guidelines
353
PIPELINE EXAMPLE 21-2
Load Hit Bypassing Load Miss (Not Supported on UltraSPARC-IIi) [%l1+%g0],%l6 [%l2+%g0],%l7 %l7,%g1,%g2 %l6,%g1,%g3 (D-cache miss) (D-cache hit) (use of D-cache hit) (use of D-cache miss)
ld ld add add
In PIPELINE EXAMPLE 21-2, the first ADD will stall the pipeline until both the load miss and the load hit are handled. If the ADDs are interchanged, the first ADD can proceed as soon as the load miss is handled. As a rule, if load latencies are expected to be a problem, the compiler should always schedule the use of loads in the same order that the loads appear in the program. While blocking part of an array in the D-cache and operating on the data during a previous D-cache miss may help reduce register pressure (three extra registers could be made available for an inner loop), the added complexity needed to handle conflicts in accessing the D-cache array offsets the potential benefits (for example, adding a port to the D-cache vs. adding a bubble on collisions).
21.3.6.2
Loads to the Same D-cache Sub-block
When a load enters the load buffer, the memory location loaded is compared to all other (older) loads in the buffer. If the other loads are to the same 16-byte sub-block, the entering load is marked as a hit, since by the time it accesses the D-cache array, the sub-block will be present (PIPELINE EXAMPLE 21-3). The detection of a hit eliminates a transaction to the E-cache, which results in making more slots available for other clients of the E-cache bus (I-cache, store buffer, snoops). Thus, it helps to organize the code so that data is accessed sequentially. This may involve interchanging loops so that array subscripts are incremented by one between each load access.
PIPELINE EXAMPLE 21-3
Interleaved D-cache Hits and Misses to Same Sub-block (D-cache (D-cache (D-cache (D-cache miss) hit) miss) hit)
.align start 16 bytes ld [start],%f0 ld [start + 8],%f2 ld [start + 16],%f4 ld [start + 24],%f6
UltraSPARC-IIi can access the E-cache only every other cycle. This still provides an average of 8 bytes per cycle, but only in 16-byte chunks.
354
UltraSPARC-IIi User’s Manual • October 1997
21.3.6.3
Mixing Independent Loads and Stores
Note – The bus turnaround penalty is two cycles for systems running in 2-2-2 mode
only; systems running in 2–2 mode incur no turnaround penalty. Mixing reads and writes from and to the E-cache results in a penalty, caused by the difference in timing between reads and writes and also the bus turnaround time. UltraSPARC-IIi automatically tends to separate loads and stores through the use of the load buffer and store buffer. The loads are given access to the E-cache, even if older stores have been waiting to access it. Only when the number of stores passes the “high-water mark” (5 stores) does the store buffer have priority. The code can be organized to further minimize the number of bus turnaround cycles. CODE EXAMPLE 21-1 shows how loads and stores can be grouped so that only one turn-around penalty occurs (for a given state of the load buffer and store buffer). This can be accomplished with the help of a memory reference analyzer (Section 21.3.9, “Non-Faulting Loads” covers this in more detail).
CODE EXAMPLE 21-1
Avoiding Bus Turnaround Penalties (1–1–1 mode only) ld st ld st [addr1],%l1 [addr2],%l2 [addr3],%l3 [addr4],%l4 2 Penalties ld[addr1],%l1 ld[addr3],%l3 st[addr2],%l2 st[addr4],%l4 1 Penalty
21.3.6.4
Using LDDF to Load Two Single-Precision Operands/Cycle
UltraSPARC-IIi supports single cycle 8-byte data transfers into the floating-point register file for LDDF. Wherever possible, applications that use single-precision floating-point arithmetic heavily should organize their code and data to replace two LDFs with one LDDF. This reduces the load frequency by approximately one half, and cuts execution time considerably.
21.3.7
Store Buffer Considerations
The store buffer on UltraSPARC-IIi is designed so that stores can be issued even when the data is not ready. More specifically, a store can be issued in the same group as the instruction producing the result. The address of a store is buffered until the data is eventually available. Once in the store buffer, the store data is buffered until it can be sent “quietly” (that is, without interfering with other instructions) to the D-cache, the E-cache, I/0 devices, or the frame buffer (for noncacheable stores).
Chapter 21
Code Generation Guidelines
355
To increase the throughput to the E-cache, which results in decreasing the frequency of the store buffer full condition, UltraSPARC-IIi collapses two stores to the same 16 bytes of memory into one store. Since compression only occurs among two adjacent entries in the store buffer, the code should be organized so that multiple stores to the same “region” in memory are issued sequentially (increasing or decreasing order).
21.3.8
Read-After-Write and Write-After-Read Hazards
A Read-After-Write (RAW) hazard occurs when a load to the same address as an older outstanding store is issued. UltraSPARC-IIi does not provide direct by-passing from intermediate stages of the store buffer to the various pipes that may result in pipeline stalls. Most RAW hazards can be eliminated by proper register allocation and by eliminating spurious loads. Disassembled traces of various programs showed that most RAWs were “false” RAWs, and can be eliminated. However, some RAWs were “true” RAWs; they occur because two data structures point to the same memory location (through array indexes or pointers) without having knowledge that there could be a match between them. In order to simplify the hardware, the full 40 physical address bits are not used when comparing the address of the memory location requested by the load with the addresses associated with the stores in the store buffer. The rules are:
s s
s
The physical tag of the address is ignored If the load hits the D-cache, bits of the address are used for comparison (byte granularity) If the load misses the D-cache, bits of the address are used for comparison (sub-block granularity)
In order to cover both cache hits and cache misses, one should try to avoid RAWs based on a 16-byte boundary (using bits ). Even if a RAW occurs, the pipeline is not stalled until a use of the load data enters the pipeline (similar to the way loads are handled during D-cache misses). CODE EXAMPLE 21-2 shows an example of backto-back instructions causing a RAW hazard and a load-use. In the best scenario (that is, when the store buffer and load buffer are empty) the RAW hazard stalls the pipe for 8 cycles (versus one cycle for the normal load-use stall). This is mainly due to the fact that the store data enters the store buffer late in the pipe and that the load buffer must wait until the data is in the D-cache before it can access it.
CODE EXAMPLE 21-2
RAW Hazard Penalty st ld add %l1,[addr1] [addr1],%l2 %l2,%l3,%l4
RAW Hazard
356
UltraSPARC-IIi User’s Manual • October 1997
Under the Relaxed Memory Order (RMO) mode, stores can pass younger loads if a MEMBAR instruction has not been issued to prevent it. UltraSPARC-IIi provides hardware detection of Write-After-Read (WAR) hazards so that a store to the same memory address as an older outstanding load does not pass that load. If a WAR hazard is detected, the store waits in the store buffer until the older load completes. The CPI penalties resulting from this only have a second-order effect on performance. The store buffer may fill up (rare), or an extra RAW hazard could be generated because stores stay in the store buffer longer.
21.3.9
Non-Faulting Loads
The ability to move instructions “up” in the instruction stream beyond conditional branches can effectively hide the latencies of long operations. This also increases the number of candidate instructions that the compiler can schedule without conflicts. SPARC-V9 provides non-faulting loads (equivalent to silent loads used for Multiflow TRACE and Cydrome Cydra-5), so that loads can be moved ahead of conditional control structures that guard their use. Non-faulting loads execute as any other loads, except that catastrophic errors, such as segmentation fault conditions, do not cause the program to terminate. The hardware and software (trap handler) cooperate so that the load appears to complete normally with a zero result. In order to minimize page faults when a speculative load references a NULL pointer (address zero), system software should map low addresses (especially address zero) to a page of all zeros and use the Non-Faulting Only (NFO) page attribute bit. Simulations of general code percolation for UltraSPARC-IIi have shown that there is much to be gained by using non-faulting loads. For integer programs the average group size (AGS) sent down the pipeline is 33% larger when code motion is allowed across one branch (using speculative loads) and 50% larger when instructions can be moved ahead of two branches.
Chapter 21
Code Generation Guidelines
357
358
UltraSPARC-IIi User’s Manual • October 1997
CHAPTER
22
Grouping Rules and Stalls
22.1
Introduction
This chapter explains in detail how to group instructions to obtain maximum throughput in UltraSPARC-IIi. The following subsections explain the formatting conventions that make it easier to understand this information.
22.1.1
Textual Conventions
Rules are presented that consider instructions in three different ways: Instructions: Actual SPARC-V9 and UltraSPARC-IIi machine instructions are always written in Mixed Case BODY FONT. Examples are:
s s s
FdMULq (Floating-point multiply double to quad—SPARC-V9) LDDF (Load Double Floating-Point Register—SPARC-V9) SHUTDOWN (Power Down Support—UltraSPARC-IIi)
Instruction Families:
These are Groups of related SPARC-V9 instructions, introduced (but not described) in The SPARC Architecture Manual, Version 9. Instruction families are always written in Mixed Case Bold Face Body Font. Examples are:
s
BPcc (Branch on Integer Condition Codes with Prediction) consists of the following instructions: BPA, BPCC, BPCS, BPE, BPG, BPGE, BPGU, BPL, BPLE, BPLEU, BPN, BPNE, BPNEG, BPPOS, BPVC, and BPVS.
359
s
FMOVcc (Move Floating-Point Register on Condition) consists of the following instructions: FMOV{s,d,q}A, FMOV{s,d,q}CC, FMOV{s,d,q}CS, FMOV{s,d,q}E, FMOV{s,d,q}G, FMOV{s,d,q}GE, FMOV{s,d,q}GU, FMOV{s,d,q}L, FMOV{s,d,q}LE, FMOV{s,d,q}LEU, FMOV{s,d,q}N, FMOV{s,d,q}NE, FMOV{s,d,q}NEG, FMOV{s,d,q}POS, FMOV{s,d,q}VC, and FMOV{s,d,q}VS.
Instruction Classes: These are groups of SPARC-V9 and UltraSPARC-IIi instructions that have similar effects. Instruction classes are always written in lower case italic body font. Examples are:
s
setcc (any instruction that sets the condition codes) alu (any instruction processed in the Arithmetic and Logic Unit)
s
22.1.2
Example Conventions
Instructions are shown with offsets between their stages, to indicate the amount of latency that normally occurs between the instructions. The following instruction pair—PIPELINE EXAMPLE 22-1—has one cycle of latency:
Instruction with one cycle of latency
E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-1
ADD SLL
i1, i2, i6 i6, 2, i8
G
This instruction pair shown in
PIPELINE EXAMPLE 22-2
PIPELINE EXAMPLE 22-2
has no latency:
Instruction with no latency
E E C C N1 N1 N2 N2 N3 N3 W W
alu store
¡ r6 ¡ r6
G G
22.2
General Grouping Rules
Up to four instructions can be dispatched in one cycle, subject to availability from the instruction buffer, execution resources, and instruction dependencies. UltraSPARC-IIi has input (read-after-write) and output (write- after-write) dependency constraints, but no anti-dependency (write-after-read) constraints on instruction grouping. Instructions belong to one or more of the following categories:
360
UltraSPARC-IIi User’s Manual • October 1997
s s s s s
Single group IEU Control transfer Load/store Floating-point/graphics
Note – CALL, RETURN, JMPL, BPr, PST and FCMP{LE,NE,GT,EQ}{16,32} belong to
multiple categories.
22.3
Instruction Availability
Instruction dispatch is limited to the number of instructions available in the instruction buffer. Several factors limit instruction availability. UltraSPARC-IIi fetches up to four instructions per clock from an aligned group of eight instructions. When the fetch address (modulo 32) is equal to 20, 24, or 28, then three, two, or one instruction(s) respectively are added to the instruction buffer. The next cache line and set are predicted using a next field and set predictor for each aligned four instructions in the instruction cache. When a set or next field mispredict occurs, instructions are not added to the instruction buffer for two clocks. When an I-cache miss occurs, instructions are added to the instruction buffer as data is returned from the E-cache.
22.4
Single Group Instructions
Certain instructions are always dispatched by themselves to simplify the hardware. These instructions are: LDD(A), STD(A), block load instructions (LDDF{A} with an ASI of 7016, 7116, 78,16 7916, F016, F116, F816, F916), ADDC{cc}, SUBC{cc}, {F}MOVcc, {F}MOVr, SAVE, RESTORE, {U,S}MUL{cc), MULX, MULScc, {U,S}DIV{X}, {U,S}DIVcc, LDSTUB{A}, SWAP{A}, CAS{X}A, LD{X}FSR, ST{X}FSR, SAVED, RESTORED, FLUSH{W}, ALIGNADDR, RETURN, DONE, RETRY, WR{PR}, RD{PR}, Tcc, SHUTDOWN, and the second control transfer instruction of a DCTI couple.
Chapter 22
Grouping Rules and Stalls
361
22.5
Integer Execution Unit (IEU) Instructions
IEU instructions can be dispatched only if they are in the first three instruction slots. A maximum of two IEU instructions can be executed in one cycle. There are two IEU pipelines: IEU0 and IEU1. The two data paths are slightly different, and some IEU instructions can be dispatched only to a particular pipeline. The following instructions can dispatched to either IEU pipeline: ADD, AND, ANDN, OR, ORN, SUB, XOR, XNOR and SETHI. These instructions can be grouped together or with older IEU0 or IEU1 specific instructions. The IEU0 data path has dedicated hardware for shift instructions: SLL{X}, SRL{X}. SRA{X}. Two shift instructions cannot be grouped together. Shift instructions can be grouped with older IEU 1 specific instructions, but they cannot be grouped with older non-specific IEU instructions. See PIPELINE EXAMPLE 22-3.
Showing allowable grouping of shift instructions
E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-3
ADD SLL
i1, i2, i6 i6, 2, i8
G
The IEU1 datapath has dedicated hardware for the condition-code-setting instructions: (TADDcc{TV}, TSUBcc{TV}, ADDcc, ANDcc, ANDNcc, ORcc, ORNcc, SUBcc, XORcc, XNORcc), EDGE and ARRAY. CALL, JMPL, BPr, PST and FCMP{LE,NE,GT,EQ}{16,32} also require the IEU1 data path (besides counting as CTI, store, or floating-point instructions respectively), since they must access the integer register file. Two instructions requiring the use of IEU 1 cannot be grouped together; for example, only one instruction that sets the condition codes can be dispatched per cycle. An IEU1 instruction can be grouped with older shift instructions and nonspecific IEU instructions.
Note – For UltraSPARC-IIi, a valid control transfer instruction (CTI) that was
fetched from the end of a cache line is not dispatched until its delay slot also has been fetched.
22.5.1
Multi-Cycle IEU Instructions
Some integer instructions execute for several cycles and sometimes prevent the dispatch of subsequent instructions until they complete.
362
UltraSPARC-IIi User’s Manual • October 1997
MULScc inserts one bubble after it is dispatched. SDIV{cc} inserts 36 bubbles, UDIV{cc} inserts 37 bubbles, and {U,S}DIVX inserts 68
bubbles after they are dispatched.
MULX, and {U,S}MUL{cc} delay dispatching subsequent instructions for a variable number of clocks, depending on the value of the rs1 operand. Four bubbles are inserted when the upper 60 bits of rs1 are zero, or for signed multiplies when the upper 60 bits of rs1 are one. Otherwise, an additional bubble is inserted each time the upper 60 bits of rs1 are not all zeros (or all ones for signed multiplies) after arithmetic right shifting rs1 by two bits. This implies a maximum of 18 bubbles for SMUL{cc}, 19 bubbles for UMUL{cc}, and 34 bubbles for MULX. WR{PR} inserts four bubbles after it is dispatched. RDPR from the CANSAVE,
CANRESTORE, CLEANWIN, OTHERWIN, FPRS, and WSTATE registers, and RD from any register are not dispatchable until four clocks after the instruction reaches the first slot of the instruction buffer. Writes to the TICK, PSTATE, and TL registers and FLUSH{W} instructions cause a pipeline flush when they reach the W Stage, effectively inserting nine bubbles.
22.5.2
IEU Dependencies
Instructions that have the same destination register (in the same register file) cannot be grouped together, unless the destination register is %g0. For example(PIPELINE EXAMPLE 22-4):
Instructions with the same destination cannot be grouped together
G E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-4
alu load
¡ i6 ¡ i6
Instructions that reference the result of an IEU instruction cannot be grouped with that IEU instruction, unless the result is being stored in %g0 See PIPELINE EXAMPLE 22-5.
Instructions cannot be grouped with the IEU instruction whose result they reference, unless stored in %g0
G E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-5
alu LDX
¡ i6 [i6+i1], i8
Chapter 22
Grouping Rules and Stalls
363
There are two exceptions to this rule: Integer stores can store the result of an IEU instruction other than FCMP{LE,NE,GT,EQ}{16,32} and be in the same group— PIPELINE EXAMPLE 22-6:
Exception to rule of PIPELINE EXAMPLE 22-5
E E C C N1 N1 N2 N2 N3 N3 W W
PIPELINE EXAMPLE 22-6
alu store
¡ r6 ¡ r6
G G
Also, BPicc or Bicc can be grouped with an older instruction that sets the condition codes as in PIPELINE EXAMPLE 22-7
Grouping BPicc or Bicc instructions
G G E E C C N1 N1 N2 N2 N3 N3 W W
PIPELINE EXAMPLE 22-7
seticc
Group1
BPicc
Instructions that read the result of a MOVcc or MOVr cannot be in the same group or the following group; see PIPELINE EXAMPLE 22-8:
Grouping for instructions that read results of MOVcc or MOVr
G E C G N1 E N2 C N3 N1 W N2 N3 W
PIPELINE EXAMPLE 22-8
MOVcc %xcc, 0, i6
LDX [i6+i1], i8
Instructions that read the result of an FCMP{LE,NE,GT,EQ}{16,32} (including stores) cannot be in the same group or in the two following groups. STD is treated as dependent on earlier FCMP instructions, regardless of the actual registers referenced—PIPELINE EXAMPLE 22-9. Rule for instructions that read the result of an
FCMP{LE,NE,GT,EQ}{16,32}
FCMPLE32 f2, f4, i6 LDX [i6+i1], i8 G E C N1 G N2 E N3 C W N1 N2 N3 W
PIPELINE EXAMPLE 22-9
In some cases, UltraSPARC-IIi prematurely dispatches an instruction that uses the result of an FCMP{LE,NE,GT,EQ}{16,32}; it then cancels the instruction in the W Stage and refetches it. This effectively inserts nine bubbles into the pipe. To avoid this, software should explicitly force the use instruction to be in the third group or later after the FCMP{LE,NE,GT,EQ}{16,32}.
364
UltraSPARC-IIi User’s Manual • October 1997
MULX, {U,S}MUL{cc}, MULScc, {U,S}DIV{X}, {U,S}DIVcc, and STD cannot be in the two groups following an FCMP{LE,NE,GT,EQ}{16,32}—PIPELINE EXAMPLE 22-10
PIPELINE EXAMPLE 22-10
MULX cannot be in the two groups following
FCMP{LE,NE,GT,EQ}{16,32}
FCMPLE32 f2, f4, i6 MUL i8,i7,i9 G E C N1 G N2 E N3 C W N1 N2 N3 W
FMOVr cannot be in the same group or in the group following an IEU instruction,
even if it does not reference the result of the IEU instruction. It cannot be in the same group (PIPELINE EXAMPLE 22-11) or the next two groups (PIPELINE EXAMPLE 22-12) following an FCMP{LE,NE,GT,EQ}{16,32}.
FMOVr i5,i7 must be at least two groups ahead of an IEU instruction
G E C G N1 E N2 C N3 N1 W N2 N3 W
PIPELINE EXAMPLE 22-11
ADD
i1, i2, i6
FMOVr i5,i7
PIPELINE EXAMPLE 22-12
FMOVr cannot be in the next two groups following an
FCMP{LE,NE,GT,EQ}{16,32}
FCMPLE16 ¡ i6 G E C N1 G N2 E N3 C W N1 N2 N3 W
FMOVr i5
22.6
Control Transfer Instructions
One Control Transfer Instruction (CTI) can be dispatched per group. The following control transfer instructions are not single group instructions: CALL, BPcc, Bicc, FB(P)fcc, BPr, and JMPL. CALL and JMPL are always dispatched as the oldest instruction in the group; that is, a group break is forced before dispatching these instructions.
DONE, RETRY, and the second instruction of a delayed control transfer instruction
(DCTI) couple flush the pipe when they reach the W Stage, effectively inserting nine bubbles into the pipe. The pipeline is flushed even if the second DCTI is annulled.
Chapter 22
Grouping Rules and Stalls
365
22.6.1
Control Transfer Dependencies
UltraSPARC-IIi can group instructions following a control transfer with the control transfer instruction. Instructions following the delay slot come from the predicted instruction stream. Examples for a branch predicted taken and a branch predicted not taken are shown in PIPELINE EXAMPLE 22-13 and PIPELINE EXAMPLE 22-14 respectively.
Branch predicted taken
G G G G E E E E C C C C N1 N1 N1 N1 N2 N2 N2 N2 N3 N3 N3 N3 W W W W
PIPELINE EXAMPLE 22-13
setcc
Group 1
BPcc
FADD (delay slot)
FMUL (branch target)
PIPELINE EXAMPLE 22-14
Branch predicted not taken
G G G G E E E E C C C C N1 N1 N1 N1 N2 N2 N2 N2 N3 N3 N3 N3 W W W W
setcc
Group 1
BPcc
FADD (delay slot)
FDIV (sequential)
If the delay slot of a DCTI is aligned on a 32-byte address boundary (that is, the DCTI is the last instruction in a cache line and the delay slot contains the first instruction in the next cache line), then the DCTI cannot be grouped with instructions pcfrom the predicted stream.—PIPELINE EXAMPLE 22-15.
Case when DCTI cannot be grouped with instructions from the predicted stream
G G G E E E G C C C E N1 N1 N1 C N2 N2 N2 N1 N3 N3 N3 N2 W W W N3 W
PIPELINE EXAMPLE 22-15
setcc
Group 1
BPcc FADD (32-byte aligned) FMUL (branch target)
Group 2
366
UltraSPARC-IIi User’s Manual • October 1997
If the second instruction of the predicted stream is aligned on a 32-byte address boundary, then the DCTI cannot be grouped with that instruction—
PIPELINE EXAMPLE 22-16
PIPELINE EXAMPLE 22-16
Cannot group DCTI with second instruction of predicted stream if it is on a 32-byte boundary
G G G E E E G C C C E N1 N1 N1 C N2 N2 N2 N1 N3 N3 N3 N2 W W W N3
BPcc
Group 1
ADD (delay slot)
FADD
Group 2
FMUL (32-byte aligned)
The delay slot of a DCTI cannot be grouped with instructions from the predicted stream of another DCTI following the delay slot— PIPELINE EXAMPLE 22-17. Cannot group DCTI delay slot with instructions from predicted
stream of following DCTI
Group 1
PIPELINE EXAMPLE 22-17
FADD (delay slot 1) BPcc
ADD (delay slot 2)
G G G
E E E G
C C C E
N1 N1 N1 C
N2 N2 N2 N1
N3 N3 N3 N2
W W W N3 W
Group 2
FMUL (branch target)
When a control transfer is mispredicted, the instruction buffer and instructions younger than the delay slot in the pipe are flushed, effectively inserting four bubbles in the pipe. An FDIV or FSQRT in the mispredicted stream causes dependent instructions in the correct branch stream to stall until the FDIV or FSQRT reaches the W1 Stage1. PIPELINE EXAMPLE 22-18 shows the case If the branch in the previous example was predicted not taken but actually were taken.
Stall after mispredicted control transfer
G G G G E E E E C C C C N1 N1 N1 N1 N2 N2 N2 N2 N3 N3 N3 N3 W W W W W1 G E
PIPELINE EXAMPLE 22-18
setcc
Group 1
BPcc (mispredicted) FADD (delay slot) FMUL
¡ f0 (sequential) f0,f0,f0 (branch target)
Group 2
FMUL
If an annulling branch is predicted not taken, the delay slot is still dispatched.
1. The W1 Stage is a virtual stage that is normally not visible to the programmer.
Chapter 22
Grouping Rules and Stalls
367
Multicycle instructions (except load instructions) run to completion, even if the delay slot instruction is annulled— PIPELINE EXAMPLE 22-19.
Multicycle instructions complete when delay-slot instruction is annulled
G E G C E N1 E N2 E N3 E W E E ...
PIPELINE EXAMPLE 22-19
BPcc, a (not taken)
imul (delay slot)
The imul unit is busy for the duration of the multiply. An annulled delay slot, other than a load, affects subsequent dependency checking until the delay slot reaches the W1 Stage—PIPELINE EXAMPLE 22-20
PIPELINE EXAMPLE 22-20
Annulled delay-slot affects subsequent dependency checking
G E G C E N1 C N2 N1 N3 N2 W N3 W W1 G
BPcc, a (not taken)
Group 1
FDIV
slot)
¡ f0 (delay
Group 2
FADD
f0,f0,f1 (sequential)
In the example above, the FADD instruction is stalled in issue until the FDIV instruction completes. A predicted annulled load does not affect dependency checking after it is dispatched—PIPELINE EXAMPLE 22-21. Predicted annulled load does not affect dependency checking after dispatch
G G E E G C C E N1 N1 C N2 N2 N1 N3 N3 N2 W W N3 W
PIPELINE EXAMPLE 22-21
BPcc, a (predicted not
Group 1
taken) fld ¡ f0 (delay slot)
Group 2
FADD
f0,f0,f1 (sequential)
An annulled load use or floating-point use is treated as a dependent instruction until the N2 Stage of the branch— PIPELINE EXAMPLE 22-22.
PIPELINE EXAMPLE 22-22 Group 1
Use treated as a dependent instruction
G G E E C C N1 N1 N2 N2 N3 N3 W W
FADD f7,f7,f6 Bcc, a (not taken)
bubble(2)
368
UltraSPARC-IIi User’s Manual • October 1997
PIPELINE EXAMPLE 22-22 Group 2 Group 3
Use treated as a dependent instruction
G flushed G E C N1 N2
FADD FADD
f6,f7,f8 f6,f7,f8
If the annulling branch is grouped with a delay slot containing a load use, the group will pay the full load use penalty even if the load use is annulled. This is because the branch is not resolved until the use stall is released.
WR{PR}, SAVE, SAVED, RESTORE, RESTORED, RETURN, RETRY, and DONE are stalled in the G-stage until earlier annulling branches are resolved, even if they are not in the delay slot. This means that they cannot be dispatched in the same group or the first three groups following an annulling branch instruction; see PIPELINE EXAMPLE 22-23. Some instructions cannot be dispatched within three groups of an annulling branch instruction
G E C N1 N2 G N3 E W C N1 N2
PIPELINE EXAMPLE 22-23
Bicc, a
SAVE
LDD{A}, LDSTUB{A}, SWAP{A} and CAS{X}A are stalled in the G-stage if there is a delayed control transfer instruction in the E Stage or C Stage; see PIPELINE EXAMPLE 22-24. Instructions that stall for delayed control transfer instruction
G E C N1 G N2 E N3 C W
PIPELINE EXAMPLE 22-24
Bicc
Bubble(2) LDD
N1
N2
22.7
Load / Store Instructions
Load / store instructions can be dispatched only if they are in the first three instruction slots. One load/store instruction can be dispatched per group. Load / store instructions other than single group are: LD{SB,SH,SW,UB,UH,UW,X}{A}, LD{D}F{A}, ST{B,H,W,X}{A}, STF{A}, STDF{A}, JMPL, MEMBAR, STBAR, PREFETCH{A}.
LDD{A}, STD{A}, LDSTUB{A}, SWAP{A} will not dispatch younger instructions for one clock after they are dispatched. CAS{X}A will not dispatch younger instructions for two clocks after they are dispatched.
Chapter 22
Grouping Rules and Stalls
369
Loads are not stalled on a cache miss, instead they are enqueued in the load buffer until data can be returned. Load data is returned in the order that loads are issued, so a cache miss forces subsequent load hits to be enqueued until the older load miss data is available. Stores are not stalled on a cache miss. Stores are enqueued in the store buffer until data can be written to the E-cache SRAM for cacheable accesses, to PCI or UPA64S for noncacheable accesses, or to the internal register for internal ASIs. Store data is written in the order that stores are issued, so a cache miss forces subsequent store hits to remain enqueued until the older store miss data is written out.
22.7.1
Load Dependencies and Interaction with Cache Hierarchy
Instructions that reference the result of a load instruction cannot be grouped with the load instruction or in the following group unless the register is %g0; see PIPELINE EXAMPLE 22-25.
Grouping instructions that reference the result of a load instruction
G E C N1 E N2 C N3 N1 W
PIPELINE EXAMPLE 22-25
LDDF Bubble(1) FMULd
[r1], f6 (not enqueued)
f4, f6, f8
G
N2
N3
Single-precision floating-point loads lock the double register containing the single precision rd for data dependency checking— PIPELINE EXAMPLE 22-26. Single-precision floating-point loads
G E C N1 E N2 C N3 N1 W
PIPELINE EXAMPLE 22-26
LDF Bubble(1) FMULs
[r1], f6 (not enqueued)
f7, f7, f8
G
N2
N3
Instructions other than floating-point loads that have the same destination register as an outstanding load are treated the same as a source register dependency— PIPELINE EXAMPLE 22-27. Instructions other than floating-point loads
G E C G N1 E N2 C N3 N1 W N2 N3
PIPELINE EXAMPLE 22-27
load ADD
i6 (not enqueued) i2, i1, i6
370
UltraSPARC-IIi User’s Manual • October 1997
When an instruction referencing a load result enters the E Stage and the data is not yet returned, all instructions in the E Stage and earlier will be stalled. If there are multiple load uses, then all E-Stage and earlier instructions will be stalled until loads that have dependencies return data. E-Stage stalls can occur when referencing the result of a signed integer load, a load that misses the D-cache or a D-cache load hit whose data is delayed following one of the two previous cases.
22.7.1.1
Delayed Return Mode
Signed integer loads that hit the D-cache cause UltraSPARC-IIi to enter delayed return mode. In delayed return mode, an extra clock of delay is added to all returning load data. UltraSPARC-IIi remains in delayed return mode until some load other than a signed integer D-cache hit can return data in the normal time without colliding with a delayed return mode load.
22.7.1.2
Cache Timing
The following example illustrates D-cache hit timing. The first load causes UltraSPARC-IIi to enter delayed return mode, returning data in the N 1 Stage. The second load is also in delayed return mode returning data in its N 1 Stage, otherwise it would collide with the first load data. The group containing the third load and the first ADD (which references the first load data) is stalled in the E Stage for one clock until both load uses by the first ADD have returned data. Since the third load is stalled in E, its normal C Stage data return will not collide with a previous delayed return mode load. This allows the last ADD to avoid an E Stage stall. If the third load were not grouped with the first ADD, it would not be stalled in the E Stage, and the last ADD would be dispatched one clock earlier. The third load causes the pipeline to exit delayed return mode.
Illustrating D-cache hit timing
G E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-28 Group 1 Group 2
LDSB LDB Bubble(1)
[i1], i6 (D-cache hit) [i3], i7 (D-cache hit)
Group 3 Group 4
LDB ADD Stall
[i7], i4 (D-cache hit) i6,i7,i8
G G
E E
E E
C C
N1 N1 N2
Bubble(2))
Group 5
ADD
i4,i5,i91ta
G
E
C
Chapter 22
Grouping Rules and Stalls
371
22.7.1.3
Block Memory Accesses
Unlike other loads, block loads do not lock all of their destination registers. If there are two block loads outstanding, any instruction except a block store is held in the G-stage until the first block load leaves the load buffer. A block load leaves the load buffer when its first word of data has returned.
22.7.1.4
Read-After-Write and Interaction with Store Buffer
If a load hits the D-cache and overlaps a store in the store buffer, the load does not return data until two clocks after the store updates the D-cache. The overlap check is pessimistic, because only the lower 14 bits of the effective memory address are checked. If a store is issued one clock earlier than an overlapping load that hits the D-cache, the load data is returned seven clocks later than normal. If a load misses the D-cache and if bits 13..4 of the load’s effective memory address are the same as a store in the store buffer, the load data is not returned until six clocks after the store leaves the store buffer. If a store is issued one clock earlier than a D-cache miss load and bits 13..4 of the address are the same, the load data is returned six clocks later than a normal D-cache miss load.
MEMBAR #StoreLoad or #MemIssue blocks younger loads from returning data
until three clocks after no older stores are outstanding (see Section 22.7.2, “Store Dependencies” on page 373). In the best case, a load use is stalled in the E Stage until 15 clocks after the previous store is dispatched.
22.7.1.5
Other Timing Issues
LD{X}FSR blocks dispatch of younger floating-point / graphics instructions that reference floating-point registers, FB{P}fcc, MOVfcc, ST{X}FSR, and LD{X}FSR instructions until four clocks after the data is returned in delayed return mode, and five clocks after the load data is returned otherwise. For example, if there are no outstanding load misses from the D-cache: LD{X}FSR blocks FP instruction issue.
G E C N1 N2 N3 W W1 W2 G
PIPELINE EXAMPLE 22-29
LDFSR (D-cache hit) FMULS f7,f7,f8
LDD{A} instructions are held in the G-stage until three clocks after the N 3 Stage, or until older loads have returned data. If LDD{A} is dispatched and a miss occurs on an N2 Stage or earlier load, the instruction will be canceled in the W Stage and fetched again. It will then be held in the G-stage until three clocks after older loads have returned data.
372
UltraSPARC-IIi User’s Manual • October 1997
FLUSH{W}, {F}MOVr, MOVcc, RDFPRS, STD{A}, loads and stores from an internal ASI (4x-6x, 76, 77), SAVE, RESTORE, RETURN, DONE, RETRY, WRPR, and MEMBAR #Sync instructions cannot be dispatched until three clocks after older loads have returned data. The instruction is stalled in the G-stage until the N 3 Stage of the earliest outstanding load, if the load is not enqueued. For example:
PIPELINE EXAMPLE 22-30
Some instructions must wait three clocks from data return of prior loads
G E C N1 N2 N3 G W E C N1
load (not enqueued) SAVE
LD{SB,SH,SW,UB,UH,UW,X}{A}, LD{D}F{A}, LDD{A}, LDSTUB{A}, SWAP{A}, CAS{X}A, LD{X}FSR, MEMBAR #MemIssue and MEMBAR #StoreLoad are held in the G-stage
if there are already nine outstanding loads. A load is considered outstanding from the clock that it enters the E Stage through the clock that it returns data.
22.7.2
Store Dependencies
A store is considered outstanding from the clock that it enters the E-stage until two clocks after the data leaves the store buffer. Data leaves the store buffer when the write is issued to the E-cache SRAM for cacheable accesses, to PCI or UPA64S for noncacheable accesses, and to internal register for internal ASI. If there is no extra delay, a noncacheable store or cacheable store that misses the D-cache is outstanding for ten clocks after it is dispatched. An internal ASI or cacheable store that hits the D-cache is outstanding for eleven clocks after it is dispatched. If the last two stores in the store buffer are writing to the same 8-byte block and both are ready to go to the E-cache, the store buffer compresses the two entries into one. This reduces the number of outstanding stores by one. Compression is repeated as long as the last two entries are ready to go and are compressible. There is additional compression of sequential 8-byte stores tp UPA64S.
ST{B,H,W,X}{A}, STF{A}, STDF{A}, STD{A}, LDSTUB{A}, SWAP{A}, CAS{X}A, FLUSH, STBAR, MEMBAR #StoreStore, and MEMBAR #LoadStore are not dispatched if
there are already eight outstanding stores. A block store counts as eight outstanding stores when it is dispatched. If bits 13..4 of a store’s effective memory address are the same as an older load in the load buffer, the store remains outstanding until four clocks after the load is not outstanding. See “Event Ordering on UltraSPARC-IIi” on page 453 for other details of event ordering.
Chapter 22
Grouping Rules and Stalls
373
LDSTUB, SWAP, CAS{X}A, store to internal ASI, block store, FLUSH, and MEMBAR
#Sync instructions are not dispatched until no older stores are outstanding. The maximum rate of internal ASI stores or atomics is one every 12 clocks.
ST{X}FSR cannot be dispatched in the two groups following another ST{X}FSR. PDIST cannot be dispatched in the group after a floating-point store or when a block
store is outstanding.
22.8
Floating-Point and Graphic Instructions
Floating-point and graphics instructions that reference floating-point registers are divided into two classes: A and M. Two of these instructions can be dispatched together only if they are in different classes.
A Class:
F{i,x}TO{s,d}, F{s,d}TO{d,s}, F{s,d}TO{i,x}, FABS{s,d}, FADD{s,d}, FALIGNDATA, FAND{s}, FANDNOT1{s}, FANDNOT2{s}, FCMP{E}{s,d}, FEXPAND, FMOVr{s,d}, FMOV{s,d}cc, FNAND{s}, FNEG{s,d}, FNOR{s}, FNOT1{s}, FNOT2{s}, FONE{s}, FOR{s}, FORNOT1{s}, FORNOT2{s}, FPADD{16,32}{s}, FPMERGE, FPSUB{16,32}{s}, FSRC1{s}, FSRC2{s}, FSUB{s,d}, FXNOR{s}, FXOR{s}, and FZERO{s}.
M Class:
FCMP{LE,NE,GT,EQ}{16,32}, FDIST, FDIV{s,d}, FMUL{d}8SUx16, FMUL{d}8ULx16, FMUL{s,d}, FMUL8x16{AL,AU}, FPACK{16,32,FIX}, FsMULd, and FSQRT{s,d}. FDIV{s,d}, FSQRT{s,d}, and FCMP{LE,NE,GT,EQ}{16,32} instructions break the group; that is, no earlier instructions are dispatched with these instructions.
22.8.1
Floating-Point and Graphics Instruction Dependencies
Instructions that have the same destination register (in the same register file) cannot be grouped together. For example: Instructions with the same destination register cannot be grouped
G E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-31
FADD
f 2, f2, f6
LDF [r0+r1], f6
374
UltraSPARC-IIi User’s Manual • October 1997
FBfcc cannot be grouped with an older FCMP{E}{s,d}, even if they reference different floating-point condition codes. For example:
These two instructions cannot be grouped
G E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-32
FCMP FBfcc
fcc0, f2, f4 fcc1, target
It is possible, however, for an FCMP{E}{s,d} to be grouped with an older FBfcc in the same group. For example:
FCMP{E}{s,d} can be grouped with an older FBfcc
G G E E C C N1 N1 N2 N2 N3 N3 W W
PIPELINE EXAMPLE 22-33
FBfcc FCMP
An FMOVcc that references the same condition code set by a FCMP{E}{s,d} cannot be in the same or the following group. For example:
Grouping for FMOVcc that references the same condition code set by a FCMP{E}{s,d}
G E C G N1 E N2 C N3 N1 W N2 N3 W
PIPELINE EXAMPLE 22-34
FCMP
fcc0, f2, f4
FMOVcc fcc0, f6, f8
FMOVcc cannot be in the same group as FCMP{E}{s,d}, because they are both A-Class
floating-point instructions. MOVcc based on a floating-point condition code can be in the same group as an FCMP{E}{s,d}, however, if they reference different condition codes. For example: MOVcc can be grouped with an FCMP{E}{s,d} if FP condition codes are different
G G E E C C N1 N1 N2 N2 N3 N3 W W
PIPELINE EXAMPLE 22-35
FCMP fcc0, f2, f4 MOVcc fcc1, f6, f8
Chapter 22
Grouping Rules and Stalls
375
Latencies between dependent floating-point and graphics instructions are shown in TABLE 22-2 on page 380. Latencies depend on the instruction generating the result (use the left column of the table to select a row) and the operation using the result (use the top row of the table to select a column). For example, PIPELINE EXAMPLE 22-36:
Groupings also depend upon latency of the instruction producing a result for a subsequent operation
G E C N1 G N2 E N3 C W N1 W C N1 N2 N2 N3
PIPELINE EXAMPLE 22-36
FADDs FMULs
f2, f3, f0 f6, f1, f2
FADDs FMOVs
f2, f3, f0 f6,f1,f2
G
E
C
N1
N2 G
N3 E
FDIV{s,d}, FSQRT{s,d}, block load, block store, ST{X}FSR, and LD{X}FSR instructions wait in the G-stage for the remaining latency of the previous divide or square root, even if there is no data dependency. An FGA or FGM instruction (see TABLE 22-2) that first enters the G-stage one cycle before an FDIV or FSQRT dependent instruction would be released will be held for one clock, regardless of data dependency.
FDIV and FSQRT use the floating-point multiplier for final rounding, so an M-Class
operation cannot be dispatched in the third clock before the divide is finished. A load use stall that occurs in the third or fourth clock before normal divide completion will delay completion by a corresponding amount.
FDIV and FSQRT stall earlier instructions with the same rd (including floating-point
loads) for the same time as a source register dependency. Graphics instructions, FdTOi, FxTOs, FdTOs, FDIVs, and FSQRTs lock the doubleprecision register containing the single-precision result for data dependency checking. For example:
Group separation because of dependency checking of prior result
G E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-37
FORs FANDs
f2, f4, f0 f1, f1, f1
376
UltraSPARC-IIi User’s Manual • October 1997
Floating-point stores other than ST{X}FSR can store the result of a floating-point or graphics instruction other than FDIV or FSQRT and be in the same group. For example:
Most FP stores can be in the same group
G G E E C C N1 N1 N2 N2 N3 N3 W W
PIPELINE EXAMPLE 22-38
FADDs STF
f2, f5, f6 f6, [address]
Floating-point stores of the result of an FDIV or FSQRT are treated the same as a dependent floating-point instruction.
ST(X)FSR cannot be dispatched in the two groups following a floating-point or graphics instruction that references the floating-point registers. For example:
PIPELINE EXAMPLE 22-39
ST(X)FSR cannot be in two groups following a reference to the FP registers
G E C N1 G N2 E N3 C W N1 N2 N3
FMULd STFSR
To simplify critical timing paths, floating-point operations are usually stalled in the G-stage until earlier floating-point operations with a different precision complete, regardless of data dependency. This behavior is described more precisely in the following two rules. Floating-point loads and stores are independent of these mixed precision rules.
s
A floating-point or graphics instruction that follows an FMOV, FABS, FNEG of different precision break the group, even if there is no data dependency. For example:
Group separation for instructions following FMOV, FABS, FNEG, of differing precision
G E G C E N1 C N2 N1 N3 N2 W N3 W
PIPELINE EXAMPLE 22-40
FMOVs FMULd
Chapter 22
Grouping Rules and Stalls
377
s
A floating-point or graphics instruction following an operation other than FMOV, FABS, FNEG, FDIV, FSQRT of different precision is stalled until the N 2 Stage of the earlier operation, even if there is no data dependency. For example:
Stall for instructions following other instructions of differing precision
G E C N1 N2 G N3 E W C N1 N2
PIPELINE EXAMPLE 22-41
FADDs
f2, f5, f0
FMULd f2, f2, f2
As an exception to the previous rule, FDIV or FSQRT can be grouped with an older operation of different precision, but are stalled until the N 2 Stage of the earlier operation otherwise. For the preceding two rules, all graphics instructions, FDIVs, FSQRTs, FdTOi, FsTOx, FiTOd, FxTOs, FsTOd, FdTOs, and FsMULd are considered to be double, even though a single-precision register is referenced. For example, the following instructions can be grouped together:
Instructions grouped because graphics instruction is considered as double
G G E E C C N1 N1 N2 N2 N3 N3 W W
PIPELINE EXAMPLE 22-42
FORs FANDs
f2, f4, f0 f2, f2, f2
22.8.2
Floating-Point and Graphics Instruction Latencies
TABLE 22-2 on page 380 documents the latencies for floating-point and graphics
instructions. For table entries containing two numbers, premature dispatching occurs when the destination and source precision are different, but both are treated as double because of a graphics or mixed-precision floating-point instruction. To avoid the pipe flush overhead, software should explicitly force the use instruction to be at least the latency number of groups after the source instruction. Mixed precision bypassing is unlikely to occur with floating-point data. Software scheduling is only needed for initializing the PDIST rd register and for graphics instructions single results used as part of a double-precision graphics source operand, or vice versa.
378
UltraSPARC-IIi User’s Manual • October 1997
The table uses the following abbreviations:
Abbreviations Used in TABLE 22-2
Meaning
TABLE 22-1
Abbrev. FGA FGM FPA FPM
Graphics A-Class instruction Graphics M-Class instruction Floating-point A-Class instruction Floating-point M-Class instruction
Chapter 22
Grouping Rules and Stalls
379
TABLE 22-2
Latencies for Floating-Point and Graphics Instructions →
FPA or FPM FGA FGM
Result used by Result generated by:
↓
FADD{s,d} FSUB{s,d} F{s,d}TO{i,x} F{i,x}TO{d,s} F{s,d}TO{d,s} FCMP{s,d} FCMPE{s,d} FMUL{s,d} FsMULd FDIV{s,d} FSQRT{s,d} FADD{s,d} FSUB{s,d} F{s,d}TO{i,x} F{i,x}TO{d,s} F{s,d}TO{d,s} FMUL{s,d} FsMULd FDIVs, FSQRTs FDIVd, FSQRTd FMOV{s,d} FABS{s,d} FNEG{s,d} FMOVr{s,d} FMOVcc{s,d} FPADD{16,32}{s} FPSUB{16,32}{s} FALIGNDATA FPMERGE FEXPAND FPACK{16,32,FIX}
FMOVr{s,d} FMOVcc{s,d} FMOV{s,d} FABS{s,d} FNEG{s,d} FPADD{16,32}{s} FPSUB{16,32}{s} FALIGNDATA FPMERGE FEXPAND
FPACK{16,32,FIX} FMUL8x16{AL,A U} FMUL{d}8ULx16 FMUL{d}8SUx16 PDIST{rs1, rs2} FCMPLE{16,32} FCMPNE{16,32} FCMPGT{16,32} FCMPEQ{16,32}
PDIST {rd}
3[4]1
4
4
[2]1
FPA or FPM
12[13]1 22[23]1 1
13 23 1
13 23 1
13 23 [2]1 [2]1
2
2
2
FGA
2
1
1[2]1
[2]1
4
3
1[4]1
[2]1
FGM
FMUL8x16{AL,A U} FMUL{d}8ULx16 FMUL{d}8SUx16 PDIST
4
3
3[4]1
1
1. Latency numbers enclosed in square brackets ([ ]) indicate cases where the hardware may prematurely dispatch a dependent instruction from the G-stage, cancel it in the W Stage, and then refetch it. This effectively inserts nine bubbles into the pipe.
380
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
A
Debug and Diagnostics Support
A.1
Overview
All debug and diagnostics accesses are double-word aligned, 64-bit accesses. Nonaligned accesses cause a mem_address_not_aligned trap. Accesses must use LDXA/ STXA/LDFA/STDFA instructions, except for the instruction cache ASIs which must use LDDA/STDA/STDFA. Using another type of load or store causes a data_access_exception trap (with SFSR.FT = 8, Illegal ASI size). An Attempt to access these registers in non-privileged mode causes a data_access_exception trap (with SFSR.FT=1, privilege violation). User accesses can be made through system calls to these facilities. See Section 15.9.4, “I-/D-MMU Synchronous Fault Status Registers (SFSR)” on page 223 for SFSR details.
Caution – A STXA to any internal debug or diagnostic register requires a MEMBAR #Sync before another load instruction is executed. The MEMBAR #Sync must also be done on or before the delay slot of a delayed control transfer instruction of any type. This condition is not only to guarantee that the result of the STXA is seen; the STXA may corrupt the load data if there is not an intervening MEMBAR #Sync.
A.2
Diagnostics Control and Accesses
The UltraSPARC-IIi diagnostics control and data registers are accessed through RDASR/WRASR or through load/store alternate instructions.
381
A.3
Dispatch Control Register
ASR 1816: The Dispatch Control Register, ASR 0x18, enables performance features related to instruction dispatch, and also controls the output of internal signals to UltraSPARC-IIi SYSADR[14:0] pins to help in chip debug and instrumentation. For a more detailed description, see Section I.1.2, “Dispatch Control Register” on page 458.
A.4
Floating-Point Control
Two state bits (PSTATE.PEF and FPRS.FEF) in the SPARC-V9 architecture provide the means to disable direct floating-point execution. If either field is cleared, an fp_disabled trap is taken when a floating-point instruction is encountered.
Note – Graphics instructions that use the floating-point register file and instructions
that read or update the Graphic Status Register (GSR) are treated as floating-point instructions. They cause an fp_disabled trap if either PSTATE.PEF or FPRS.FEF is cleared. See Section 13.4, “Graphics Instructions” on page 138 for more information.
A.5
Watchpoint Support
UltraSPARC-IIi implements “break before” watchpoint traps; instruction execution is stopped immediately before the watchpoint memory location is accessed. TABLE A-1 on page 383 lists ASIs that are affected by the two watchpoint traps. For 128-bit atomic load and 64-byte block load and store, a watchpoint trap is generated only if the watchpoint overlaps the lowest addressed 8 bytes of the access.
Note – In order to avoid trapping indefinitely, software should emulate the
instruction at the watched address and execute a DONE instruction or turn off the watchpoint before exiting a watchpoint trap handler.
382
UltraSPARC-IIi User’s Manual • October 1997
TABLE A-1
ASIs Affected by Watchpoint Traps
ASI Range D-MMU Watchpoint if Matching VA Watchpoint if Matching PA
ASI Type
Translating ASIs
0416 .. 1116, 1816..1916, 2416..2C16, 7016 .. 7116, 7816..7916, 8016 ..FF16 1416..1516, 1C16 .. 1D16 4516 .. 6F16, 7616 .. 7716, 7E16..7F16
On Off
Y N
Y Y
Bypass ASIs Nontranslating ASIs
—
N
Y
—
N
N
A.5.1
Instruction Breakpoint
There is no hardware support for instruction breakpoint in UltraSPARC-IIi. The TA (Trap Always) instruction can be used to set program breakpoints.
A.5.2
Data Watchpoint
Two 64-bit data watchpoint registers provide the means to monitor data accesses during program execution. When virtual/physical data watchpoint is enabled, the virtual/physical addresses of all data references are compared against the content of the corresponding watchpoint register. If a match occurs, a VA_/PA_watchpoint trap is signalled before the data reference instruction is completed. The virtual address watchpoint trap has higher priority than the physical address watchpoint trap. Separate 8-bit byte masks allow watchpoints to be set for a range of addresses. Zero bits in the byte mask causes the comparison to ignore the corresponding bytes in the address. These watchpoint byte masks and the watchpoint enable bits reside in the LSU_Control_Register. See Section A.6, “LSU_Control_Register” on page 384 for a complete description.
Appendix A
Debug and Diagnostics Support
383
A.5.3
Virtual Address (VA) Data Watchpoint Register
DB_VA 63 FIGURE A-1 44 43 3 2 — 0
VA Data Watchpoint Register Format (ASI 5816, VA=3816)
DB_VA: The 64-bit virtual data watchpoint address
Note – UltraSPARC-I and UltraSPARC-II support a 44-bit virtual address space.
Software must write a sign-extended 64-bit address into the VA watchpoint register. The watchpoint address is sign-extended to 64 bits from bit 43 when read.
A.5.4
Physical Address Data Watchpoint Register
DB_PA 63 FIGURE A-2 41 40 3 2 — 0
PA Data Watchpoint Register Format (ASI 5816, VA=4016)
DB_PA: The 41-bit physical data watchpoint address
Note – UltraSPARC-I and UltraSPARC-II support a 41-bit physical address space.
Software must write a zero-extended 64-bit address into the watch point register.
A.6
LSU_Control_Register
ASI 4516, VA=0016 Name: ASI_LSU_CONTROL_REGISTER
s
The LSU_Control_Register contains fields that control several memory-related hardware functions in UltraSPARC-IIi. These include I-cache, D-cache, MMUs, bad parity generation, and watchpoint setting. See also TABLE 17-3 on page 272 for the state of this register after reset or RED_state trap.
384
UltraSPARC-IIi User’s Manual • October 1997
— 63
— 44 43
— 42
— 41 40
PM 33 32
VM
PR PW VR VW — 25 24 23 22 21 20 19
FM 4
DM IM DC IC 3 2 1 0
FIGURE A-3
LSU_Control_Register Access Data Format (ASI 4516)
A.6.1
Cache Control
IC:L SU.I-cache_enable; if cleared, misses are forced on I-cache accesses with no cache fill. DC:L SU.D-cache_enable; if cleared, misses are forced on D-cache accesses with no cache fill. A FLUSH, DONE, or RETRY instruction is needed after software changes this bit to ensure the new information is used.
A.6.2
MMU Control
IM: LSU.enable_I-MMU; if cleared, the I-MMU is disabled (pass-through mode). DM: LSU.enable_D-MMU; if cleared, the D-MMU is disabled (pass-through mode).
Note – When the MMU/TLB is disabled, a VA is passed through to a PA. Accesses
are assumed to be non-cacheable with side-effects.
A.6.3
Parity Control
FM LSU.parity_mask; if set, UltraSPARC-IIi writes generate incorrect parity on the E-cache data bus for bytes corresponding to this mask. The parity_mask corresponds to the 16 bytes of the E-cache data bus.
Note – The parity mask is endian-neutral.
Appendix A
Debug and Diagnostics Support
385
TABLE A-2 Parity Mask
LSU Control Register: Parity Mask Examples
Addr of Bytes Affected
FEDC 0000 0000 0010 1111
BA98 0000 0000 0010 1111
7654 0000 0000 0010 1111
3210 0000 0000 0010 1111
000016 000116 222216 FFFF16
A.6.4
Watchpoint Control
Watchpoint control is further discussed in Section A.5, “Watchpoint Support” on page 382.
A.6.4.1
Virtual Address Data Watchpoint Enable
VR, VW: LSU.virtual_address_data_watchpoint_enable; if VR/VW is set, a data read/write that matches the (range of) addresses in the virtual watchpoint register causes a watchpoint trap. Both VR and VW may be set to place a watchpoint for either a read or write access.
A.6.4.2
Virtual Address Data Watchpoint Byte Mask
VM LSU.virtual_address_data_watchpoint_mask; the virtual_address_data_watch_point_register contains the virtual address of a 64-bit word to be watched. The 8-bit virtual_address_data_watch_point_mask controls which bytes within the 64-bit word should be watched. If all eight bits are cleared, the virtual watchpoint is disabled. If watchpoint is enabled and a data reference overlaps any of the watched bytes in the watchpoint mask, a virtual watchpoint trap is generated.
386
UltraSPARC-IIi User’s Manual • October 1997
TABLE A-3 Watchpoint Mask
LSU Control Register: VA/PA Data Watchpoint Byte Mask Examples
Addr of Bytes Watched 7654 3210
0016 0116 3216 FF16
Watchpoint disabled
0000 0011 1111
0001 0010 1111
A.6.4.3
Physical Address Data Watchpoint Enable
PR, PW: LSU.physical_address_data_watchpoint_enable; if PR/PW is set, a data read/write that matches the (range of) addresses in the physical watchpoint register causes a watchpoint trap. Both PR and PW may be set to place a watchpoint on either a read or write access.
A.6.4.4
Physical Address Data Watchpoint Byte Mask
PM: LSU.physical_address_data_watchpoint_mask; the physical_address_data_watch_point_register contains the physical address of a 64bit word to be watched. The 8-bit physical_address_data_watch_point_mask controls which bytes within the 64-bit word should be watched. If all eight bits are cleared, the physical watchpoint is disabled. If the watchpoint is enabled and a data reference overlaps any of the watched bytes in the watchpoint mask, a physical watchpoint trap is generated.
A.7
I-cache Diagnostic Accesses
The instruction cache (I-cache) utilizes the Dynamic Set Prediction technique to realize a set-associative cache with a direct-mapped physical RAM design. The direct-mapped RAM core is logically divided into two sets. Rather than using the tag to determine which set contains the requested instructions, a set prediction from the last access to the I-cache is used to access the instructions for the current fetch.
Appendix A
Debug and Diagnostics Support
387
Cache Lines
LRU sp 1b 2×1b
FIGURE A-4
next 2×11b
BRPD 4×2b
pre-decode 8×4b
instruction 8×32b
tag 28b
valid 1b
Simplified I-cache Organization (Only 1 Set Shown)
Each set of the I-cache is divided into four fields per entry:
s s s
The instruction field contains eight 32-bit instructions. The tag field contains a 28-bit physical tag and a valid bit. The pre-decode field contains eight 4-bit information packets about the instructions stored. The next field contains the LRU bit, next address, branch and set predictions. There is one physical LRU bit per I-cache line (that is, 16 instructions) but it is logically replicated for each set. There are four 2-bit dynamic branch prediction (BRPD) fields, one for each two adjacent instructions. Two sets of set prediction and next address fields, one for each four instructions.
s
Note – To simplify the implementation, read access to the instruction cache fields
(ASIs 6016..6F16) must use the LDDA instruction instead of LDXA or LDDFA. Using another type of load causes a data_access_exception trap (with SFSR.FT = 8, Illegal ASI size). LDDA updates two registers. The useful data is in the odd register, the contents of the even register are undefined.
A.7.1
I-cache Instruction Fields
ASI 6616, VA=0, VA=IC_set, VA=IC_addr, VA=0 Name: ASI_ICACHE_INSTR
— 63 FIGURE A-5 14 IC_set 13 12 IC_addr 3 2 — 0
I-cache Instruction Access Address Format (ASI 6616)
IC_set: This 1-bit field selects a set (2-way associative).
388
UltraSPARC-IIi User’s Manual • October 1997
IC_addr: This 10-bit index selects an aligned pair of 32-bit instructions.
IC_instr 0 63 FIGURE A-6 33 32 IC_instr 1 0
I-cache Instruction Access Data Format (ASI 6616)
IC_instr: two 32-bit instruction fields
A.7.2
I-cache Tag/Valid Fields
ASI 6716, VA=0, VA=IC_set, VA=IC_addr, VA=0 Name: ASI_ICACHE_TAG
— 63 FIGURE A-7 14 IC_set 13 12 IC_addr 5 4 — 0
I-cache Tag/Valid Access Address Format (ASI 6716)
IC_set: This 1-bit field selects a set (2-way associative). IC_addr: This 8-bit index (VA) selects a cache tag.
Undefined
63 FIGURE A-8 37 IC_valid 36 35 IC_tag
Undefined
8 7 0
I-cache Tag/Valid Field Data Format (ASI 6716)
Undefined: The value of these bits are undefined on reads and must be masked off by software. IC_valid: The 1-bit valid field IC_tag: The 28-bit physical tag field (PA of the associated instructions)
A.7.3
I-cache Predecode Field
ASI 6E16, VA=0, VA=IC_set, VA=IC_addr, VA=IC_line, VA=0 Name: ASI_ICACHE_PRE_DECODE
Appendix A
Debug and Diagnostics Support
389
— 63 FIGURE A-9 14
IC_set 13 12
IC_addr
IC_line 5 4 3 2
— 0
I-cache Predecode Field Access Address Format (ASI 6E16)
IC_set: This 1-bit field selects a set (2-ways). IC_addr: This 8-bit index (i.e. addr ) selects an IC_Line. IC_line: For LDDA accesses, this 2-bit field selects a pair of pre-decode fields in a 64bit-aligned instruction pair. For STXA accesses, the least significant bit is ignored. The most significant bit selects four pre-decode fields in a 128-bit-aligned instruction quad.
Undefined
63 FIGURE A-10 8 7 IC_pdec 0 4 3 IC_pdec 1 0
I-cache Predecode Field LDDA Access Data Format (ASI 6E16)
Undefined
63 FIGURE A-11 16 15
IC_pdec 0
IC_pdec 1 12 11 8 7
IC_pdec 2 4 3
IC_pdec 3 0
I-cache Predecode Field STXA Access Data Format (ASI 6E16)
Undefined: The value of these bits are undefined on reads and must be masked off by software. IC_pdec: The two 4-bit pre-decode fields. The encodings are:
s s s s s s
Bits = 00 CALL, BPA, FBA, FBPA or BA Bits = 01 Not a CALL, JMPL, BPA, FBA, FBPA or BA Bits = 10 Normal JMPL (do not use return stack) Bits = 11 Return JMPL (use return stack) BitIf clear, indicates a PC-relative CTI. BitIf set, indicates a STORE.
Note – The predecode bits are not updated when instructions are loaded into the
cache with ASI_ICACHE_INSTR. They are only accurate for instructions loaded by instruction cache miss processing.
390
UltraSPARC-IIi User’s Manual • October 1997
A.7.4
I-cache LRU/BRPD/SP/NFA Fields
ASI 6F16, VA=0, VA=IC_set, VA=IC_addr, VA=0 Name: ASI_ICACHE_PRE_NEXT_FIELD
— 63 FIGURE A-12 14 IC_set 13 12 IC_addr 5 IC_line 4 3 — 0
I-cache LRU/BRPD/SP/NFA Field Access Address Format (ASI 6F16)
Stores to ASI_ICACHE_PRE_NEXT_FIELD are undefined unless the instruction cache is disabled via the IC bit of the LSU control register (see “LSU_Control_Register” on page 384). IC_set: This 1-bit field selects a set (2-way associative). IC_addr: This 8-bit index (addr ) selects an IC_Line. IC_line: This 1-bit field selects two BRPD and one NFA fields for four 128-bit aligned instructions.
Undefined 63 FIGURE A-13 25 IC_lru 24 IC_sp 23 IC_nfa 22 IC_brpd 0 IC_brpd 1 und. 12 11 10 9 8 7 0
I-cache LRU/BRPD/SP/NFA Field LDDA Access Data Format (ASI 6F16)
Undefined, und: The value of these bits are undefined on reads and must be masked by software. IC_lru: selects the least recently accessed set of the line corresponding to IC_addr. There is only one physical LRU bit per IC_addr value (i.e. cache line). The IC_lru field can be read for each value of IC_set and IC_line, but can only be written when IC_set is zero.
Note – The LRU bit is not updated when instructions are accessed with
ASI_ICACHE_INSTR. IC_brpd: Two 2-bit dynamic branch prediction fields. The encodings are
s s
IC_brpdIf set, strong prediction IC_brpdIf set, taken prediction
Appendix A
Debug and Diagnostics Support
391
During I-cache miss processing, IC_brpd is initialized to likely-taken if either of the corresponding instructions is a branch with static prediction bit set; otherwise, IC_brpd is set to likely-not-taken. The prediction bits are subsequently updated according to the dynamic branch history of the corresponding instructions, as shown in FIGURE A-14. (Note: This figure is identical to FIGURE 21-6.)
Initialization
PT/ANT PT/ANT PT,AT ST PT/AT LT PNT/AT PT: Predicted Taken PNT: Predicted Not Taken AT: Actual Taken ANT: Actual Not Taken
FIGURE A-14
PNT/ANT LNT PNT/AT SNT
PNT/ANT
ST: Strongly Taken LT: Likely Taken SNT: Strongly Not Taken LNT: Likely Not Taken
Dynamic Branch Prediction State Diagram
IC_sp 1-bit Set-Prediction (SP) field; selects the next set from which to fetch IC_nfa1 1-bit Next-Field-Address field (NFA = VA); selects the next line from which to fetch and the instruction offset within that line
Note – The branch prediction, set prediction and next field address fields are not
updated when instructions are loaded into the cache with ASI_ICACHE_INSTR. When a cache line is brought into the I-cache, the corresponding IC_sp fields are initialized to the same set as the currently missed line. The corresponding IC_nfa fields are initialized to the next sequential sub-block.
A.8
D-cache Diagnostic Accesses
Two D-cache ASI accesses are supported: data (ASI 46 16) and tag/valid (ASI 47 16).
392
UltraSPARC-IIi User’s Manual • October 1997
A.8.1
D-cache Data Field
ASI 4616, VA=0, VA=DC_addr, VA=0 Name: ASI_DCACHE_DATA
— 63 FIGURE A-15 14 13 DC_addr 3 2 — 0
D-cache Data Access Address Format (ASI 4616)
DC_addr: This 11-bit index selects a 64-bit data field (16KB).
DC_data 63 FIGURE A-16 0
D-cache Data Access Data Format (ASI 4616)
DC_data: 64-bit data
A.8.2
D-cache Tag/Valid Fields
ASI 4716, VA=0, VA=DC_addr, VA=0 Name: ASI_DCACHE_TAG
— 63 FIGURE A-17 14 13 DC_addr 5 4 — 0
D-cache Tag/Valid Access Address Format (ASI 4716)
DC_addr: This 9-bit index selects a tag/valid field (512 tags).
— 63 FIGURE A-18 30 29 DC_tag DC_valid 2 1 0
D-cache Tag/Valid Access Data Format (ASI 4716)
DC_tag: The 28-bit physical tag (PA of the associated data). DC_valid: The 2-bit valid field, one for each sub-block (32b block, 16b sub-block). Bit corresponds to the highest addressed 16 bytes, bit to the lowest addressed 16 bytes.
Appendix A
Debug and Diagnostics Support
393
A.9
E-cache Diagnostics Accesses
Compatibility Note – Because of the smaller external cache data and tag, some
adjustments are made to these diagnostic accesses. Separate ASIs are provided for reading (0x7E) and writing (0x76) the E-cache tags and data.
Note – During E-cache diagnostics accesses, the VA is passed through to the PA
without page mapping. To avoid undesired modifications of the E-cache state, Take care when using ldxa/stxa instructions with these ASIs to prevent cacheable instruction prefetch PA that matches the PA of the E-cache diagnostic access. It is permissible, however, for the E-cache state to change; there is no hardware conflict involved.
Caution – Using ASI 0x76/77/7E/7F with VA[40:39]==00 and a VA[15:0] matching
any of the PA[15:0] listed for the CSR addresses in noncacheable space, other than 0x00, 0x18, 0x20, 0x38, 0x40, 0x50, 0x60, or 0x70, can cause a load to return data, and a store to modify, the corresponding CSR. The list of addresses is in Section 19.4.3, “DMA Error Registers” on page 330. These ASIs are protected by privilege bit/trap so as not to provide an unprotected back-door access.
A.9.1
E-cache Data Fields
s s s s s
ASI 0x76 (WRITING) or 0x7E (READING), VA==0, VA==1, VA==0, VA==EC_addr, VA==0 (0.25MB) VA==0, VA==EC_addr, VA==0 (0.5MB) VA==0, VA==EC_addr, VA==0 (1 MB) VA==0, VA==EC_addr, VA==0 (2 MB)
Name: ASI_ECACHE_W (0x76), ASI_ECACHE_R (0x7E
— 63 FIGURE A-19 41 40
01 39 38
— 21 20
EC_addr 3 2
— 0
E-cache Data Access Address Format
394
UltraSPARC-IIi User’s Manual • October 1997
EC_addr: A 15-bit index selects a 64-bit data field from a 0.25 MB E-cache. A 16-bit index selects a 64-bit data field from a 0.5 MB E-cache. A 17-bit index selects a 64-bit data field from a 1 MB E-cache. An 18-bit index selects a 64-bit data field from a 2 MB E-cache.
EC_data 63 FIGURE A-20 0
E-cache Data Access Data Format
EC_data: 64-bit data
A.9.2
E-cache Tag/State/Parity Field Diagnostic Accesses
s s s s s s
ASI 0x76 (WRITING) or 0x7E (READING), VA==0, VA==2, VA==0, VA==EC_addr, VA==0 (0.25MB) VA==0, VA==EC_addr, VA==0 (0.5MB) VA==0, VA==EC_addr, VA==0 (1 MB) VA==0, VA==EC_addr, VA==0 (2 MB) Name: ASI_ECACHE_W (0x76), ASI_ECACHE_R (0x7E)
— 63 FIGURE A-21 41 40
10 39 38
— 22 21
EC_addr 6 5
— 0
E-cache Tag Access Address Format
If read, the contents of the E-cache tag/state/parity fields in the selected E-cache line are stored in the E-cache_tag_data_register. This register can be read by an LDA with ASI_ECACHE_TAG_DATA; its contents are written to the destination register. If written, the content of the E-cache_tag_data_register is written to the selected Ecache tag/state/parity fields. The content of the E-cache_tag_data_register are previously updated with STA at ASI_ECACHE_TAG_DATA.
Note – Software must ensure that the two-step operations are done atomically; e.g., LDXA ASI_ECACHE (TAG) and LDXA ASI_ECACHE_TAG_DATA, STXA ASI_ECACHE_TAG_DATA and STXA ASI_ECACHE (TAG).
Appendix A
Debug and Diagnostics Support
395
Note – The destination register of a LDXA ASI_ECACHE (TAG) is undefined. It is
recommended to use %g0 as the destination for this ASI access. Similarly, the contents of the destination register in STXA ASI_ECACHE (TAG) is ignored, but the contents of the E-cache_tag_data_register are written to the selected E-cache line.
A.9.3
E-cache Tag/State/Parity Data Accesses
ASI 0x4E, VA==0 Name: ASI_ECACHE_TAG_DATA
— 63 FIGURE A-22 29 17 EC_parity 16 15 EC_state 00 EC_tag 0
14 13 12 11
E-cache Tag Access Data Format
EC_tag:14-bit physical tag field
s
EC_tag==00, PA of associated data. Note EC_tag always read as 0’s. (The actual SRAM contents are returned, but UltraSPARC-IIi always forces 0’s on all tag writes)
EC_state: 2-bit E-cache state field. Encodings are
s s s s
EC_state == 00 Invalid EC_state == 01 Not Used EC_state == 10 Exclusive EC_state == 11 Modified
EC_parity: 2-bit E-cache tag (odd) parity field
s
EC_parityParity of EC_state Tag parity on normal operation is computed using the actual PA. If that PA ==01 or 10 (greater than the supported DRAM) a tag parity error is created.
s
EC_parityParity of EC_tag
396
UltraSPARC-IIi User’s Manual • October 1997
A.10
A.10.1
Memory Probing and Initialization
Initialization
The following steps must be performed before any access can be made to memory. 1. Determine the operating frequency of the system, then initialize the Mem_Control1 register with the appropriate values for the given operating frequency. See Section 18.3, “Mem_Control1 Register (0x1FE.0000.F018)” on page 282. 2. Enable refresh by setting the RefEnable bit in the Mem_Control0 register. See Section 18.2, “Mem_Control0 Register (0x1FE.0000.F010)” on page 279. This action supplies the DRAMs with their required minimum of eight RAS cycles to initialize their internal circuitry before they can be accessed. Refresh is turned on by setting the RefEnable bit in the Mem_Control0 register. ()RefInterval should be set to a value assuming a full memory system (see RefInterval table). Also, the DIMMPairPresent bits should all be set to 1. After the probing step, RefInterval and DIMMPairPresent can be set to the proper values (must first turn off RefEnable). After setting the RefEnable, wait at least
(8 DIMMs)*(8 refreshes)*(RefInterval)*(32 clocks)*(clock period) seconds
before beginning the probing step.
A.10.2
Memory Probing
The only way to determine the number and size of DIMMs in the system is by probing. That is, writing to certain memory locations, and reading back to determine the effects of those writes. This section describes an algorithm for DIMM probing that is based upon the behavior of the hardware and the supported DIMM configurations. The algorithm employs the fact that writes to non-existent addresses can “wrap around” and overwrite data in a valid location (assuming that a DIMM is present). The algorithm described in the following sections specifies these addresses. The data pattern that is written to each location should contain a unique bit-signature, rather than consisting of all 0’s or all 1’s. All addresses for block write/read within a DIMM slot are specified below as PA[26:0]. PA[29:27] are varied for probing different DIMM slots/banks.
Appendix A
Debug and Diagnostics Support
397
Perform the two steps below for PA[29:27] = 000, 001, 010, 011, in 10-bit column address mode. This covers a single bank in all four DIMM-pair slots/banks.
A.10.3
Detection of DIMM presence
To check whether a DIMM-pair is present or not, perform a write to a block of memory beginning at 0x000_0000, then read back from this location. If incorrect data is returned and/or an ECC error is generated, then there is no DIMM-pair at this location. Skip to the next DIMM-pair. The data pattern written to each location should contain a unique bit-signature, rather than consisting of all 0s or all 1s.
A.10.4
Determination of DIMM pair Size
To determine the base size of the existing DIMMs, write to 0x100_0000, then read from 0x000_0000. If the read does not return the data initially written to 0x000_0000, DIMM size is 8 MB. This is because an 8 MB DIMM only has 24 address bits and the write to 0x100_0000 wrapped to overwrite the contents of 0x000_0000. Perform a write to 0x200_0000, then read from 0x000_0000. If the read does not return the data written to 0x000_0000, the DIMM is of 16-MB capacity. This is because 16 MB DIMM only has 25 valid address bits, so the write to 0x200_0000 wrapped and overwrote the contents of 0x000_0000. If the correct data is returned, write to 0x400_0000 and read back from 0x000_0000. If the read does not return the data originally written into 0x000_0000, this indicates a 32 MB DIMM. The 32 MB DIMM has 26 valid address bits so the write to 0x400_0000 wrapped and overwrote the contents of 0x000_0000. If the correct data is returned in 10-bit column address mode, this indicates a 64 MB DIMM—The largest possible using 10-bit column address mode. If in 11-bit column address mode, and the correct data is returned, write to 0x800_0000. Read back from 0x000_0000. If the read fails to return the data originally written into 0x000_0000, this indicates a 64 MB DIMM. A 64 M-byte DIMM has 27 bits of valid address, so the write to 0x800_0000 wrapped around and overwrote the contents of 0x000_0000. Return of correct data indicates a 128 MB DIMM—the largest possible in 11-bit column address mode. Repeat with PA[29]==1 to check for a second bank on each DIMM.
398
UltraSPARC-IIi User’s Manual • October 1997
A.10.5
Determination of DIMM pair size equivalence
For each DIMM pair, the above process should be repeated with PA[4]==1. The size of the other DIMM in the pair should be the same. If not, the smaller result must be used.
A.10.6
11-bit Column Address Mode
The DIMMs may have 11-bit column addresses, in which case they may be twice as large as previously indicated. 11-bit column addresses are supported with a mode bit in the Mem_Control0 CSR. It should only be enabled if all DIMMs have 11-bit column addresses. Only DIMM pairs 0 and 2 are used in 11-bit column address mode. After determining which DIMMs are present, the boot PROM should determine if DIMM pairs 0 and 2 have 11-bit column addresses, and, if so, enable that mode. Since column address bit [10] is always PA[14], 11-bit column addresses can be detected by the same algorithm used above to detect DIMM presence,. Instead of toggling high order PA bits, PA[14] is toggled while all other bits are kept constant (the PA to use depends on the DIMM pair being tested). If toggling PA[14] causes overwrite while the 11-bit column address mode is enabled, then the DRAMs in that DIMM should be assumed to be 10-bit column address DIMMs, and the mode not used. Ideally, the PA[14] test should be used on every DIMM (2 in each pair) by toggling PA[4] also, to guarantee that matching DIMMs have been inserted before 11-bit column address mode is allowed. If enabled, the sizes of DIMM pair 0 and 2 are doubled if they exist, and pair 1 and 3 should be ignored because they should not exist.
A.10.7
Banked DIMMs
The probing algorithm should also toggle PA[29] to determine if banked DIMMs are present, as above.
Appendix A
Debug and Diagnostics Support
399
A.10.8
Completion of probing
Write RefInterval and DIMMPairPresent with the appropriate values after the probing is finished. After the probing step is performed, then the physical memory space available in the machine is known. The boot processor can then initialize data and ECC in the entire memory space with known values using block writes. After this step is performed, the memory system is ready for operation.
400
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
B
Performance Instrumentation
B.1
Overview
Two performance events can be measured simultaneously in UltraSPARC-IIi. The Performance Control Register (PCR) controls event selection and filtering (that is, counting user and/or system level events) for a pair of 32-bit Performance Instrumentation Counters (PICs).
B.2
Performance Control and Counters
The 64-bit PCR and PIC are accessed through read/write Ancillary State Register instructions (RDASR/WRASR). PCR and PIC are located at ASRs 16 (10 16) and 17 (1116) respectively. Access to the PCR is privileged. Non privileged accesses cause a privileged_opcode trap. Non-privileged access to PICs may be restricted by setting the PCR.PRIV field while in privileged mode. When PCR.PRIV=1, an attempt by nonprivileged software to access the PICs causes a privileged_action trap. Event measurements in non-privileged and/or privileged modes can be controlled by setting the PCR.UT and PCR.ST fields. Two 32-bit PICs each accumulate over 4 billion events before wrapping around. There is no special handling or notification when the counters wrap. Extended event logging may be accomplished by periodically reading the contents of the PICs before each overflows. Additional statistics can be collected using the two PICs over multiple passes of program execution.
401
Two events can be measured simultaneously by setting the PCR.select fields together with the PCR.UT and PCR.ST fields. The selected statistics are reflected during subsequent accesses to the PICs. The difference between the values read from the PIC on two subsequent reads reflects the number of events that occurred between them. Software may only rely on read-to-read counts of the PIC for accurate timing and not on write-to-read counts. See also Table 17-3, “Machine State After Reset and in RED_state,” on page 272 for the state of these registers after reset.
— 63 FIGURE B-1 15 14
S1 11 10
— 8 7
S0 4
— 3
UT 2
ST 1
PRIV 0
Performance Control Register (PCR)
S1|S0: Two four-bit fields; each selects a performance instrumentation event from the list in Section B.4.5, “PCR.S0 and PCR.S1 Encoding” on page 407. The event selected by S0 is counted in PIC.D0; the event selected by S1 is counted in PIC.D1. UT: User_trace; if set, events in non-privileged (user) mode are counted. This may be set along with PCR.ST to count all selected events. ST: System_trace; if set, events in privileged (system) mode are counted. This may be set along with PCR.UT to count all selected events. PRIV: Privileged; if set, non-privileged access to the PIC will cause a privileged_action trap.
D1 63 FIGURE B-2 32 31
D0 0
Performance Instrumentation Counters (PIC)
D1|D0: A pair of 32-bit counters; D0 counts the events selected by PCR.S0; D1 counts the events selected by PCR.S1.
B.3
PCR/PIC Accesses
An example of the operational flow in using the performance instrumentation is shown in FIGURE B-3.
402
UltraSPARC-IIi User’s Manual • October 1997
start
set up PCR sel ¡ PCR.sel [0,1] ¡ PCR.UT/ST [0,1] ¡ PCR.PRIV PIC[PCR.sel] ¡ Rd
context switch to B PCR ¡ [saveA1] PIC ¡ [saveA2] PIC[PCR.sel] ¡ Rd
accumulate stat in PIC
PIC[PCR.sel] ¡ Rd switch to context B accumulate stat in PIC end back to context A PIC[PCR.sel] ¡ Rd context switch to A [saveA1] ¡ PCR accumulate stat in PIC [saveA2] ¡ PIC PIC[PCR.sel] ¡ Rd
FIGURE B-3
PCR/PIC Operational Flow
B.4
Performance Instrumentation Counter Events
Instruction Execution Rates
Cycle_cnt [PIC0,PIC1]: accumulated cycles; this counter is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields.
B.4.1
Appendix B
Performance Instrumentation
403
Instr_cnt [PIC0,PIC1]: the number of instructions completed; annulled, mispredicted or trapped instructions are not counted. Using the two counters to measure instruction completion and cycles allows calculation of the average number of instructions completed per cycle.
B.4.2
Grouping (G) Stage Stall Counts
These are the major cause of pipeline stalls (bubbles) from the G Stage of the pipeline. Stalls are counted for each clock for which the associated condition is true. Dispatch0_IC_miss [PIC0]: I-buffer is empty from I-cache miss. This includes E-cache miss processing if an E-cache miss also occurs. Dispatch0_mispred [PIC1]: I-buffer is empty from Branch misprediction. Branch misprediction kills instructions after the dispatch point, so the total number of pipeline bubbles is approximately twice as big as measured from this count. Dispatch0_storeBuf [PIC0]: Store buffer can not hold additional stores, and a store instruction is the first instruction in the group. Dispatch0_FP_use [PIC1]: The first instruction in the group depends on an earlier floating point result that is not yet available, but only while the earlier instruction is not stalled for a Load_use (see B.4.3). Thus, Dispatch0_FP_use and Load_use are mutually exclusive counts. Some less common stalls (see Chapter 22, “Grouping Rules and Stalls”) are not counted by any performance counter. This situation includes one cycle stalls for an FGA/FGM instruction entering the G stage following an FDIV or FSQRT.
B.4.3
Load Use Stall Counts
Stalls are counted for each clock that the associated condition is true. Load_use [PIC0]: An instruction in the execute stage depends on an earlier load result that is not yet available. This stalls all instructions in the execute and grouping stages. Load_use also counts cycles when no instructions are dispatched due to a one cycle load-load dependency on the first instruction presented to the grouping logic. There are also overcounts due to, for example, mispredicted CTIs and dispatched instructions that are invalidated by traps.
404
UltraSPARC-IIi User’s Manual • October 1997
Load_use_RAW [PIC1]: There is a load use in the execute stage and there is a readafter-write hazard on the oldest outstanding load. This indicates that load data is being delayed by completion of an earlier store. Some less common stalls (see Chapter 22, “Grouping Rules and Stalls”) are not counted by any performance counter, including:
s s s
Stalls associated with WRPR/RDPR and internal ASI loads MEMBAR stalls One cycle stalls due to bad prediction around a change to the Current Window Pointer (CWP)
B.4.4
Cache Access Statistics
I-, D-, and E-cache access statistics can be collected. Counts are updated by each cache access, regardless of whether the access will be used. IC_ref [PIC0]: I-cache references; I-cache references are fetches of up to four instructions from an aligned block of eight instructions. I-cache references are generally prefetches and do not correspond exactly to the instructions executed. IC_hit [PIC1]: I-cache hits DC_rd [PIC0]: D-cache read references (including accesses that subsequently trap); non d-cacheable accesses are not counted. Atomic, block load, “internal,” and “external” bad ASIs, quad precision LDD, and MEMBARs also fall into this class. Atomic instructions, block loads, “internal” and “external” bad ASIs, quad LDD, and MEMBARs also fall into this class. DC_rd_hit [PIC1]: D-cache read hits are counted in one of two places:
s
s
When they access the D-cache tags and do not enter the load buffer (because it is already empty) When they exit the load buffer (due to a D-cache miss or a non-empty load buffer)
Loads that hit the D-cache may be placed in the load buffer for a number of reasons — because of a non-empty load buffer, for example. Such loads may be turned into misses if a snoop occurs during their stay in the load buffer (due to an external request or to an E-cache miss). In this case they do not count as D-cache read hits. See Section 21.3, “Data Stream Issues” on page 350. DC_wr [PIC0]: D-cache write references (including accesses that subsequently trap); non D-cacheable accesses are not counted. DC_wr_hit [PIC1]: D-cache write hits EC_ref [PIC0]: total E-cache references; non-cacheable accesses are not counted.
Appendix B
Performance Instrumentation
405
EC_hit [PIC1]: total E-cache hits. EC_write_hit_RDO [PIC0]: E-cache hits that do a read for ownership of a UPA transaction. EC_wb [PIC1]: E-cache misses that do writebacks EC_snoop_inv [PIC0]: E-cache invalidates from the following UPA transactions: S_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQ EC_snoop_cb [PIC1]: E-cache snoop copy-backs from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ EC_rd_hit [PIC0]: E-cache read hits from D-cache misses EC_ic_hit [PIC1]: E-cache read hits from I-cache misses The E-cache write hit count is determined by subtracting the read hit and the instruction hit count from the total E-cache hit count. The E-cache write reference count is determined by subtracting the D-cache read miss (D-cache read references minus D-cache read hits) and I-cache misses (I-cache references minus I-cache hits) from the total E-cache references. Because of store buffer compression, this value is not the same as D-cache write misses.
Note – A block memory access is counted as a single reference. Atomics count the
read and write individually.
406
UltraSPARC-IIi User’s Manual • October 1997
B.4.5
PCR.S0 and PCR.S1 Encoding
TABLE B-1 S0 Value
PiC.S0 Selection Bit Field Encoding
PIC0 Selection
0000 0001 0010 0011 1000 1001 1010 1011 1100 1101 1110 1111
Cycle_cnt Instr_cnt Dispatch0_IC_miss Dispatch0_storeBuf IC_ref DC_rd DC_wr Load_use EC_ref EC_write_hit_RDO EC_snoop_inv EC_rd_hit
TABLE B-2 S1 Value
PIC.S1 Selection Bit Field Encoding
PIC1 Selection
0000 0001 0010 0011 1000 1001 1010 1011 1100 1101 1110 1111
Cycle_cnt Instr_cnt Dispatch0_mispred Dispatch0_FP_use IC_hit DC_rd_hit DC_wr_hit Load_use_RAW EC_hit EC_wb EC_snoop_cb EC_ic_hit
Appendix B
Performance Instrumentation
407
408
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
C
IEEE 1149.1 Scan Interface
C.1
Introduction
UltraSPARC-IIi provides an IEEE Std 1149.1-1990-compliant test access port (TAP) and boundary scan architecture. The primary use of 1149.1 scan interface is for board-level interconnect testing and diagnosis. The IEEE 1149.1 test access port and boundary scan architecture consists of three major parts:
s s s
Test access port controller Instruction register Test data registers (numerous; public and private)
For information about how to obtain a copy of IEEE Std 1149.1-1990, see “Bibliography.”
C.2
Interface
The IEEE Std 1149.1-1990 serial scan interface is composed of a set of pins and a TAP controller state machine that responds to the pins. The five wire IEEE 1149.1 interface is used in UltraSPARC-IIi. TABLE C-1 describes the five pins.
409
TABLE C-1 Signal
IEEE 1149.1 Signals
I/O Description
TDO TDI TMS
O I I
Test data out. This is the scan shift output signal from either the instruction register or one of the test data registers. Test data input. This forms the scan shift in signal for the instruction and various test data registers. This signal is used to sequence the TAP state machine through the appropriate sequences. Holding this signal high for at least five clock cycles will force the TAP to the TEST-LOGIC-RESET state. Test clock. The inputs TDI and TMS are sampled on the rising edge of TCK and the TDO output becomes valid after the falling edge of TCK. The IEEE 1149.1 logic is asynchronously reset when TRST_L goes low.
TCK TRST_L
I I
C.3
Test Access Port Controller
The Test Access Port (TAP) controller is a 16-state synchronous finite state machine. Transitions between states occur only at the rising edge of TCK in response to the TMS signal, or when TRST_L is asserted
410
UltraSPARC-IIi User’s Manual • October 1997
.
TEST-LOGIC-RESET 1 0
RUN-TEST/IDLE 0
1
SELECT-DR-SCAN 0
1
SELECT-IR-SCAN 0
1
1
CAPTURE-DR 0
CAPTURE-IR 0
SHIFT-DR 1 0
SHIFT-IR 1
1 EXIT-1-DR 0 EXIT-2-IR 0
PAUSE-DR 1 0
PAUSE-IR 1
0 EXIT-2-DR 1
0 EXIT-2-IR 1
UPDATE-DR 1 0
UPDATE-IR 1 0
TABLE C-2
TAP Controller State Diagram
Appendix C
IEEE 1149.1 Scan Interface
411
TABLE C-2 shows the state machine diagram. The values shown adjacent to state transitions represents the value of TMS required at the time of a rising edge of TCK for the transition to occur. Note that the IR states select the instruction register and DR states refer to states that may select a test data register, depending on the active instruction.
C.3.1
TEST-LOGIC-RESET
The TAP controller enters the TEST-LOGIC-RESET state when the TRST_L pin is asserted or when the TMS signal is held high for at least five clock cycles, regardless of the original state of the controller. It remains in this state while TMS is held high. In this state the test logic is disabled and the instruction register is initialized to select the Device ID register.
C.3.2
RUN-TEST/IDLE
RUN-TEST/IDLE is an intermediate controller state between scan operations. If no instruction is selected, all test data registers retain their current states. Once the state machine enters this state, it remains there for as long as TMS is held low.
C.3.3
SELECT-DR-SCAN
SELECT-DR-SCAN is a temporary state in which all test data registers retain their previous states.
C.3.4
SELECT-IR-SCAN
SELECT-IR-SCAN is another temporary state in which all test data registers retain their previous states.
C.3.5
CAPTURE IR/DR
In this state, the selected register, which can be either an instruction register or a data register, loads data into its parallel input.
412
UltraSPARC-IIi User’s Manual • October 1997
For the instruction register, this corresponds to sampling the eight bits of status information and loading the constant ‘01’ pattern into the two least significant bit locations.
C.3.6
SHIFT IR/DR
In this state, the IR/DR shift towards their serial output during each rising edge of TCK.
C.3.7
EXIT-1 IR/DR
This state is a temporary controller state in which the IR/DR retain their previous states.
C.3.8
PAUSE IR/DR
This state is a temporary controller state in which the IR/DR retain their previous states. It is provided to temporarily halt data-shifting through the instruction register or the test data register—without having to stop TCK.
C.3.9
EXIT-2 IR/DR
This state is a temporary controller state in which the IR/DR retain their previous states.
C.3.10
UPDATE IR/DR
Data is latched on to the parallel output of the IR/DR from the shift-register path during this controller state. The data held at the previous outputs of the instruction register or test data register only changes in this controller state.
Appendix C
IEEE 1149.1 Scan Interface
413
C.4
Instruction Register
The instruction register is used to select the test to be performed and the test data register to be accessed. This register is 8-bits wide and consists of a serial-input/serial-output shift-register that has parallel inputs and a parallel output stage. The parallel outputs are loaded during the UPDATE-IR state with the instruction shifted into the shift register stage. This method ensures that the instruction only changes synchronously at the end of an instruction register shift or on entry to the TEST-LOGIC-RESET state. The behavior of the instruction register in each controller state is shown in TABLE C-3.
TABLE C-3
Instruction Register Behavior
Shift Register Parallel Output
Controller State
TEST-LOGIC-RESET CAPTURE IR SHIFT IR UPDATE IR All other states
Undefined Load 01 into IR Shift towards serial output Retain last state Retain last state
Set to 0016 (select Device ID register for shift) Retain last state Retain last state Load from shift-register stage Retain last state
At the start of an instruction register shift, that is, during the CAPTURE-IR state, a constant ‘01’ pattern loads into the least-significant two bits to aid fault isolation in the board-level serial test data path.
C.5
Instructions
The UltraSPARC-IIi 8-bit instruction register (IR) implements public and private instructions. Out of the 256 encodings possible, there are 75 valid instructions. All invalid encodings default to the BYPASS instruction as defined in IEEE Std 1149.11990. The public instructions implemented are: BYPASS, IDCODE, EXTEST, SAMPLE and INTEST. Private instructions are used in manufacturing and should not be used before consulting your SPARC sales representative. The instruction encodings and the test data register selected is presented in TABLE C-4.
414
UltraSPARC-IIi User’s Manual • October 1997
TABLE C-4 Instruction
IEEE 1149.1 Instruction Encodings
IR encoding Scan Chain
BYPASS IDCODE EXTEST SAMPLE INTEST PLLMODE CLKCTRL RAMWCP POWERCUT HIGHZ INTEST2 FULLSCAN
FF16 FE16 0016 0716 0116 9F16 9D16 BD16 8E16 FD16 8F16 4016..7F16
bypass id register boundary boundary boundary pll mode clock control ram control N/A bypass boundary internal
C.5.1
C.5.1.1
Public Instructions
BYPASS
The BYPASS instruction selects the BYPASS register as the active test data register.
C.5.1.2
SAMPLE/PRELOAD
SAMPLE/PRELOAD selects the active test data register to be the boundary scan register. Without disturbing normal processor operation, this instruction enables the I/O pin states to be observed or a value to be shifted in to the boundary scan chain.
C.5.1.3
EXTEST
EXTEST selects the boundary scan register to be the active test data register and is used to perform board level interconnect testing. In this condition the boundary scan chain drives the processor pins and UltraSPARC-IIi cannot function normally.
Appendix C
IEEE 1149.1 Scan Interface
415
C.5.1.4
INTEST
This instruction selects the boundary scan register to be the active test data register. allowing it to be used as a virtual low-speed functional tester. The on-chip clock is derived from TCK and is issued in the Run-Test/Idle state of the TAP controller.
C.5.1.5
IDCODE
IDCODE selects the ID register for shifting.
C.5.2
Private Instructions
All private instructions: PLLMODE, CLKCTRL, RAMWCP, POWERCUT, HIGHZ, INTEST2, and all versions of FULLSCAN should not be used before consulting your SPARC sales representative. Improper use of any private instructions can permanently damage UltraSPARC-IIi and render it inoperative.
C.6
C.6.1
Public Test Data Registers
Device ID Register
The 32-bit Device ID register is loaded with the UltraSPARC-IIi ID upon entering the CAPTURE-DR TAP state when the ID instruction is active or during the TESTLOGIC-RESET state. FIGURE C-1 shows the structure of the Device ID Register.
Version
31 FIGURE C-1 28 27
0100 0110 0110 1000
12 11
000 0100 0101
1
1
0
Device ID Register
The device ID is loaded into the register on the rising edge of TCK in the CaptureDR state. The value of ID is fixed at 4668045F 16 and the version number, ID, changes as specified in IEEE Std 1149.1-1990.
416
UltraSPARC-IIi User’s Manual • October 1997
C.6.2
Bypass Register
This register provides a single bit delay between TDI and TDO. During the CAPTURE-DR controller state, and if it is selected by the current instruction, the bypass register loads a logical zero.
C.6.3
Boundary Scan Register
The Boundary Scan Register allows for testing circuitry external to the device; for example:
s
s
s
testing the interconnect by setting defined values at the device periphery – using the EXTEST instruction sampling and examination of pin states without disturbing the system – using the SAMPLE/PRELOAD instruction testing device function itself – using the INTEST instruction
The boundary scan register for UltraSPARC-IIi is 770 bits long. The mapping between register bits and the pin signals is described in a Boundary Scan Description Language (BSDL) file available from your SPARC sales representative.
Note – It is recommended that transitions from the Capture-DR TAP controller state to the Shift-DR controller state progress through the Exit1-DR, Pause-DR, and Exit2DR states. A direct progression from Capture-DR to Shift-DR is not recommended when the boundary scan register is selected.
C.6.4
Private Data Registers
Private data registers should not be accessed before consulting your SPARC sales representative.
Appendix C
IEEE 1149.1 Scan Interface
417
418
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
D
ECC Specification
D.1
ECC Code
The 64-bit ECC code specification can be found in Shigeo Kaneda’s correspondence note: “A Class of Odd-Weight-Column SEC-DED-SbED Codes for Memory System Applications”, IEEE Transactions on Computers, August 1984.
TABLE D-1 shows the syndrome table for the ECC code, followed by the Verilog code
for error detection, correction, and syndrome generation..
Syndrome table for ECC SEC/S4ED code .
TABLE D-1 SYND bits 7 6 5 4 0123 0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
0000 0001 0010 0011 0100 0101 0110 0111 1000
* C0 C1 D C2 D D T C3
C4 D D 32 D 57 M D D
C5 D D 33 D 61 04 D D
D 00 01 D 10 D D M 15
C6 D D 42 D 59 39 D D
D 25 29 D 27 Q D M 31
D M 36 D 07 D D 54 M
T D D M D M 22 D D
C7 D D 47 D 63 M D D
D 05 M D M D D 50 38
D 17 21 D 19 Q D M 23
T D D M D M 30 D D
D 08 13 D 02 D D T 03
T D D T D M 16 D D
T D D T D M 24 D D
Q 12 09 D 14 D D M 11
419
TABLE D-1 SYND bits 7 6 5 4 0123 0 0 0 0
Syndrome table for ECC SEC/S4ED code (Continued).
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
1001 1010 1011 1100 1101 1110 1111
D D T D T T Q
37 49 D 40 D D 44
M 53 D 45 D D 41
D D M D T T D
M 51 D 34 D D 46
D Q M D M M D
D D 62 D 48 56 D
18 M D T D D M
06 55 D 35 D D 43
D D 58 D 52 60 D
D Q M D M M D
26 M D T D D M
D D T D M M D
20 M D M D D M
28 M D M D D M
D D M D M M Q
CODE EXAMPLE D-1 describes the check bit generation equations in the most concise
way .
Description of ECC checkbit Generation Equations
CODE EXAMPLE D-1
function [7:0] get_ecc8; input [63:0] data; begin get_ecc8[7:0] = { ^(64'h9494884855bb7b6c ^(64'h49494494bb557b8c ^(64'h6161221255eede93 ^(64'h16161161ee55de23 ^(64'h55bb7b6c94948848 ^(64'hbb557b8c49494494 ^(64'h55eede9361612212 ^(64'hee55de2316161161 end endfunction
& & & & & & & &
data[63:0]), data[63:0]), data[63:0]), data[63:0]), data[63:0]), data[63:0]), data[63:0]), data[63:0]) };
420
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
E
UPA64S interface
E.1
UPA64S Bus
The UPA64S bus transfers data in a packetized mode between UltraSPARC-IIi and system DRAM. In addition it is used to transfer data to a connected UPA64S device, for example, a Fast Frame Buffer (FFB).
E.1.1
Data Bus (MEMDATA)
MEMDATA is a 72-bit bidirectional bus between UltraSPARC-IIi and the memory transceivers. Bits[63:0] are also used to connect to a UPA64S device. The transaction set supports block transfers of 64 bytes; and quadword noncached transfers of 1 to 16 bytes, qualified with a 16-bit bytemask. Data transfers are 8 bytes per UPA clock cycle on MEMDATA[63:0].
FIGURE E-1 illustrates how data and ECC bytes are arranged and addressed within a
doubleword.
63 56 55 Byte 0 Byte 1 48 47 40 39 Byte 2 Byte 3 32 31 Byte 4 24 23 Byte 5 16 15 Byte 6 8 7 Byte 7 0
Dword Bytes
FIGURE E-1
Data Byte Addresses Within a Dword
421
E.1.2
SYSADDR Bus
UltraSPARC-IIi directly sends a request to the UPA64S slave, using SYSADDR and ADR_VLD, which are always driven.
E.2
UPA64S Transaction Overview
s s
s
P_REQ transaction request from UltraSPARC-IIi to the UPA64S device on the SYSADDR bus; these transactions initiate activity. P_REPLY by UPA64S device is generated in response to a previous P_REQ transaction; indicates read data available, or write data consumed. S_REPLY by UltraSPARC-IIi initiates transfer of data.
E.2.1
NonCachedRead (P_NCRD_REQ)
Noncached Read; generated by UltraSPARC-IIi for a load or instruction fetch to noncached UPA64S address. 1, 2, 4, 8, and 16 bytes are read with this transaction, and the byte location is specified with a bytemask. The address is aligned on a 16-byte boundary. The bytemask is aligned on a natural boundary that matches the total data size. One P_NCRD_REQ may be outstanding to UPA64S device at a time. The next P_NCRD_REQ request can be sent on the cycle after the P_RASB reply. Data is transferred with S_SRS reply.
E.2.2
NonCachedBlockRead (P_NCBRD_REQ)
Noncached Block Read Request; 64 bytes of non-cached data is read with this transaction generated by UltraSPARC-IIi for block read of a non-cached UPA64S address space. Similar to P_NCRD_REQ except that there is no bytemask; the data is aligned on a 64-byte boundary (PA = 016). Data is delivered with S_SRS reply.
422
UltraSPARC-IIi User’s Manual • October 1997
E.2.3
NonCachedWrite (P_NCWR_REQ)
Noncached Write; generated by UltraSPARC-IIi to write a non-cached address UPA64S space. The address is aligned on 16-byte boundary. An arbitrary number of 0-16 bytes can be written as specified by a 16-bit bytemask to slave devices that support writes with arbitrary byte masks (mainly graphics devices). A bytemask of all zeros indicates a no-op at the slave. S_SWS is used to transfer the data. When UltraSPARC-IIi drives the S_REPLY, it considers the transaction completed and decrements the count of outstanding requests for flow control.
E.2.4
NonCachedBlockWrite (P_NCBWR_REQ)
Noncached Block Write Request; 64 bytes of noncached data is written by UltraSPARC-IIi with this transaction; generated for block store to a non-cached UPA64S address. Similar to P_NCWR_REQ except that there is no bytemask; the data is aligned on a 64-byte boundary (PA = 016). Data is transferred with S_SWB reply.
E.3
E.3.1
P_REPLY and S_REPLY
P_REPLY
The UPA64S device drives P_REPLY to UltraSPARC-IIi. All P_REPLYs are generated as an acknowledgment by the UPA64S device in response to a request previously sent by UltraSPARC-IIi.
Appendix E
UPA64S interface
423
TABLE E-1 Type
P_REPLY Type Definitions
Definition
P_IDLE P_RASB
Idle. The default state of the wires when there is no reply to be given. Read Ack single and Block. 16 or 64 bytes of data are ready in its output data queue for the P_NCRD_REQ | P_NCBRD_REQ request sent to it, and there is room in its input request queue for another P_REQ. UltraSPARC-IIi knows, from programmable registers, the depth of the queues on the UPA64S device, and does not cause the queues to be overflowed, or underflowed. Write Ack Single; reply to P_NCWR_REQ request for single writes The UPA64S port acknowledges that the 16 bytes of data placed in its input data queue has been absorbed, and there is room for writing another 16 bytes of data into the input data queue, and there is room in its input request queue for another slave P_REQ for data. Write Ack Block; reply to P_NCBWR_REQ for block write; the UPA64S slave port acknowledges that the 64 bytes of data placed in its input data queue has been absorbed, and there is room for writing another 64 bytes of data into the input data queue, and there is room in its input request queue for another slave P_REQ for data.
P_WAS
P_WAB
TABLE E-2 shows the encodings for the transactions defined in TABLE E-1.
TABLE E-2 P_REPLY
P_REPLY Encoding
Name Reply to Transaction
P_IDLE P_WAB P_WAS P_RASB
Idle Write ACK Block Write ACK Single Read ACK Single/Block
Default State P_NCBWR_REQ P_NCWR_REQ P_NCRD_REQ, P_NCBRD_REQ
00 01 10 11
E.3.2
S_REPLY
S_REPLY is a 3-bit signal between UltraSPARC-IIi and the UPA64S device. TABLE E-4 specifies the S_REPLY encoding. S_REPLY takes a single UPA clock cycle, and initiates data transfer on MEMDATA. The encoding for S_IDLE is 00. (also driven during reset).
424
UltraSPARC-IIi User’s Manual • October 1997
TABLE E-3 specifies the S_REPLY types. The following rules apply to S_REPLY
generation:
s s
The S_REPLY is strongly ordered with respect to requests. The S_REPLY timing to the source and sink of data is shown in FIGURE E-2 and FIGURE E-3.The UPA64S device drives the data 2 UPA clock cycles after receiving S_SRS | S_SRB. UPA64S receives data 1 UPA clock cycle after S_SWS | S_SWB The S_REPLY read data timing after receiving a P_REPLY from is shown in FIGURE E-4. The minimum number of clock cycles between the P_REPLY and the S_REPLY is two; that is, this number represents the earliest time after receiving P_REPLY that S_REPLY can be sent to get the data. S_REPLY can be pipelined such that the MEMDATA bus can be kept continually busy without any dead cycles on the MEMDATA bus, as long as the same source is driving the data If sources are switched, one dead cycle is required on the MEMDATA bus; this allows the first source to switch off before the next source can drive the data. The earliest that the next source can drive the data is in the cycle following the dead cycle; thus, the pipelining of data accompanying S_REPLY types is adjusted accordingly with one extra bubble for the dead cycle. The ordering of S_REPLY for delivering data to a UPA64S device is shown in FIGURE E-5.
s
s
s
s
TABLE E-3 Type
S_REPLY Type Definitions
Definition
S_IDLE S_SRS S_SRB S_SWB S_SWS
Idle. The default state; indicates no reply. Read Single Ack; the output data queue of the UPA64S device drives 16 bytes of read data in response to P_RAS reply. Read Block Ack; the output data queue of the UPA64S device drives 64 bytes of read data in response to P_RAB reply from it. Write Block Ack; the input data queue of the UPA64S device accepts a 64 bytes of data. Write Single Ack; the input data queue of the UPA64S device accepts 16 bytes of data.
Appendix E
UPA64S interface
425
.
TABLE E-4 S_REPLY
S_REPLY Encoding
Name Reply to Transaction
S_IDLE S_SWS S_SWB S_SRS S_SRB
Idle Slave Write Single Slave Write Block Slave Read Single Slave Read Block
Default State P_NCWR_REQ P_NCBWR_REQ P_NCRD_REQ P_NCBRD_REQ
000 100 101 110 111
E.3.3
P_REPLY and S_REPLY Timing
The following figures show the control of data flow on the MEMDATA bus due to S_REPLY and P_REPLY.
S_REPLY Data on Bus
S_SRB
D[0]
D[1]
D[2]
D[3]
2 clocks
FIGURE E-2
S_REPLY Timing: UPA64S device Sourcing Block
426
UltraSPARC-IIi User’s Manual • October 1997
Data on Bus S_REPLY to Data Sink
S_SWB
D[0]
D[1]
D[2]
D[3]
1 clock
FIGURE E-3
S_REPLY Timing: UPA64S device Sinking Block
S_REPLY to Data Source Data on Bus S_REPLY to Data Sink P_REPLY from Slave
P_RASB
S_SRB D[0] D[1] D[2] D[3]
S_SWB
min 2 clocks
1 clock 2 clocks
FIGURE E-4
P_REPLY to S_REPLY Timing
S_REPLY to
S_SWS
S_SWS
S_SRB
Data on Bus
D[1]
D[2]
D[3]
P_REQ
NCWR1 NCWR1
NCWR2
NCWR2 NCBRD3 NCBRD3
FIGURE E-5
S_REPLY Pipelining
Appendix E
UPA64S interface
427
E.4
Issues with Multiple Outstanding Transactions
Strong Ordering
All prior 16-byte noncacheable stores (P_NCWR_REQ) must complete before completing a P_NCRD_REQ. This condition is necessary to meet a software requirement that all noncacheable operations can be strongly ordered. The E-bit feature of UltraSPARC-IIi does not wait for prior noncacheable operations to complete (as do MEMBARs). While a 16-byte noncacheable load is outstanding (P_NCRD_REQ), UltraSPARC-IIi will not issue any more transactions, so the reverse case—completing noncacheable loads before noncacheable stores—does not occur.
E.4.1
E.4.2
Limiting the Number of Transactions
UltraSPARC-IIi can limit the total number of outstanding transactions, and additionally, can limit the amount of outstanding data creating by outstanding stores.
E.4.3
S_REPLY assertion
The assertion of S_REPLYs must guarantee that there is at least one dead cycle between different drivers (for example, port and memory). No dead cycle is required for multiple packets from the same driver.
428
UltraSPARC-IIi User’s Manual • October 1997
E.5
E.5.1
UPA64S Packet Formats
Request Packets
The SYSADDR bus is a 29-bit transaction request bus. The request packet comprises 58 bits and is carried on the SYSADDR bus in two successive UPA64S clock cycles.
First Cycle
28 25 24
Second Cycle
28
Transaction Type
ByteMask
Physical Address
0 FIGURE E-6
13 12
Physical Address
0
Packet Format: Noncached P_REQ Transactions
E.5.2
E.5.2.1
Packet Description
Transaction Type
This 4-bit field encodes the transaction type, as shown in TABLE E-5.
Transaction Type Encoding
Name Type
TABLE E-5
Transaction Type
P_NCRD_REQ P_NCBRD_REQ P_NCBWR_REQ P_NCWR_REQ
NonCachedRead NonCachedBlockRead NonCachedBlockWrite NonCachedWrite
0101 0110 0111 1110
E.5.2.2
Physical Address PA
Bits PA of the 39-bit physical address space accessible to UltraSPARC-IIi.
Appendix E
UPA64S interface
429
The low order 4 bits PA of the physical address are implied in the bytemask in P_NCRD_REQ and P_NCWR_REQ transactions. All other transactions transfer 64byte blocks and do not need PA, since it is 0 16.
E.5.2.3
Bytemask
Bytemask is only available for P_NCRD_REQ and P_NCWR_REQ. This 16-bit field indicates valid bytes on MEMDATA. The bytemask can be 1-, 2-, 4-, 8- and 16-byte for non-cached read requests; arbitrary bytemasks are allowed for slave writes. An allzero bytemask indicates a no-op at the slave. Bytemask corresponds to byte 0 (bits in cycle 0 on the 64-bit data bus.
read request? no
yes
yes yes
outstanding read? no
write request? no
wait 1 clk
yes
outstanding p_reply == max?
no
send addr pckt1 assert addr_valid send addr pckt2 deassert addr_valid
FIGURE E-7
UPA64s Transactions Flowchart—Address Bus
430
UltraSPARC-IIi User’s Manual • October 1997
wait 1 clock read data available?
no
write data ready?
yes yes
outstanding writes == max?
yes
databus available next clock?
no no
no
databus available next clock?
yes
send s_reply dead cycle read 8B of data
no
yes
send s_reply write 8B of data
no
block read?
yes
read 8B of data read 8B of data
yes
block write?
no
write 8B of data
read 8B of data write 8B of data read 8B of data write 8B of data read 8B of data write 8B of data read 8B of data write 8B of data read 8B of data write 8B of data read 8B of data write 8B of data write 8B of data
FIGURE E-8
UPA64s Transactions Flowchart—Data Bus
Appendix E
UPA64S interface
431
432
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
F
Pin and Signal Descriptions
F.1
Introduction
This Appendix gives a general description the UltraSPARC-IIi pins and signals. Consult the relevant data sheets for detailed information about the electrical and mechanical characteristics of the processor, including pin and pad assignments. “Bibliography” on page 485 describes the available data sheets and how to obtain them.
433
F.2
F.2.1
Pin Interface Signal Descriptions
External Cache (E-cache) Interface
TABLE F-1 Symbol V
Pin Reference - External Cache (E-cache) Interface1 2
Type Signal Transitions Aligned w/ Name and Function
EDATA[63:0]
I/O
E-cache Data Bus; Connects UltraSPARC-IIi to the E-cache data RAMs; clocked at 1/2 the processor clock rate E-cache Data Parity; odd parity is driven or checked for all EDATA transfers; MSB corresponds to the MS byte of EDATA; clocked at 1/2 the processor clock rate E-cache Tag Data. Bits 15:14 carry the MEI I state; bits[13:0] carry the physical address bits [31:18]; allows a minimum cache size of 256k bytes; all TDATA bits are used, even when the E-cache is more than 256 kilobytes; clocked at 1/2 the processor clock rate. E-cache Tag Parity; odd parity for TDATA[15:0]; TPAR[1] covers TDATA[15:8]; TPAR[0] covers TDATA[7:0]; clocked at 1/2 the processor clock rate SRAM_CLK_A/B E-cache Byte Write Enables; active low bit [0] controls EDATA[63:56]; bit 7 controls EDATA[7:0]; clocked at 1/ 2 the processor clock rate E-cache Data Address; corresponds to physical address [20:3]; allows a maximum 2 MB E-cache; clocked at 1/2 the processor clock rate E-cache Tag Address; corresponds to physical address [20:6]; allows a maximum 2 MB E-cache, with 64-byte lines; clocked at 1/2 the processor clock rate E-cache Data Write Enable; active low; clocked at 1/2 the processor clock rate E-cache Data Operation Enable; active low; asserted on all SRAM operations; clocked at 1/2 the processor clock rate
EDPAR[7:0]
I/O
TDATA[15:0]
I/O
TPAR[1:0]
I/O
BYTEWE_L[7:0]
2.6 V
O
ECAD[17:0]
O
ECAT[14:0]
O
DSYN_WR_L DOE_L
O O
434
UltraSPARC-IIi User’s Manual • October 1997
TABLE F-1 Symbol V
Pin Reference - External Cache (E-cache) Interface1 2 (Continued)
Type Signal Transitions Aligned w/ Name and Function
TSYN_WR_L TOE_L ECACHE_22_MODE 2.6 V
O O I SRAM_CLK_A/B
E-cache Tag Write Enable; active low; clocked at 1/2 the processor clock rate E-cache Tag Operation Enable; active low; clocked at 1/ 2 the processor clock rate Selects E-cache 22 (1-tie high) or 222 mode (0-tie low). (2 cycle read pipeline, or 3 cycle read pipeline)
3.3 V
Not Aligned Static (all modes)
1. Connect unused inputs to the appropriate level. 2. Use approximately 10 kΩ resistors for pullups (unused) and 1 kΩ for pulldowns. Never tie a pin directly to a to a supply rail.
Appendix F
Pin and Signal Descriptions
435
F.2.2
Internal, SRAM, and UPA Clock Interface
TABLE F-2
Pin Reference - Internal, SRAM, and UPA Clock Interface
Type Signal Transitions Aligned w/ Name and Function
Symbol
V
CLKA I
Primary positive differential clock source to UltraSPARC-IIi; normally (in 2X mode) runs at 1/2 the internal clock rate; during test, when the PLL is bypassed, the full internal clock rate can be used See UltraSPARC-IIi data sheet1 for logical relation of clocks Primary negative differential clock source to UltraSPARC-IIi; normally (in 2X mode) runs at 1/2 the internal clock rate; during test, when the PLL is bypassed, the full internal clock rate can be used Signals run at 1/3 frequency of the internal CPU clock; also used to drive the UPA64S; when the UPA64S interface is used these signals indicate to the processor which CLKA edge corresponds to a UPA_CLK_POS edge Signals run at 1/2 the internal clock rate; also drive the SRAMs; they indicate to the processor which CLKA edges correspond to SRAM_CLK_POS clock edges Used during test to bypass PLL and PLL2; clock from differential receiver is directly passed to the clock tree; during PLLBYPASS, SRAM_CLK_POS and SRAM_CLK_NEG must be 1/2 the frequency of CLKA and CLKB; also during PLLBYPASS, UPA_CLK_POS and UPA_CLK_NEG must be 1/3 the frequency of CLKA and CLKB; during PLLBYPASS mode, PCI_REF_CLK must be 2X frequency of PCI_CLK Internal level 5 clock that reflects the CPU clock; used to determine PLL lock or clock tree delay when in PLL bypass mode; may be disabled during normal operation
CLKB I PECL UPA_CLK_POS, UPA_CLK_NEG I
SRAM_CLK_POS SRAM_CLK_NEG PLLBYPASS
I
3.3 V
I
Static Signal
L5CLK 2.6 V
1. See “Bibliography”
O
CLKA and CLKB
436
UltraSPARC-IIi User’s Manual • October 1997
F.2.3
PCI Clock Interface
TABLE F-3
Pin Reference - PCI Clock Interface
Type Signal Transitions Aligned w/ Name and Function
Symbol
V
PCI_REF_CLK PCI_CLK
3.3 V 3.3 V
I I
See UltraSPARC-IIi data sheet1 for logical relations.
PCI reference clock; 40-66 MHz. PCI clock, 66mhz; can be set to 33 MHz PCI interface if desired. Disabled during normal operation; internal level 5 clock that reflects the PCI clock and is used to determine PLL lock or clock tree delay when in PLLBYPASS mode; during PLLBYPASS mode, PCI_REF_CLK must be 2X frequency of PCI_CLK Refer to TABLE F-2 on page 436
P2L5CLK
2.6 V
O
PCI_REF_CLK
PLLBYPASS
1. See “Bibliography”
3.3 V
I
Appendix F
Pin and Signal Descriptions
437
F.2.4
JTAG/Debug Interface
TABLE F-4
Pin Reference - JTAG/Debug Interface
Type Signal Transitions Aligned w/ Name and Function
Symbol
V
TDI TCK TMS TRST_L
I I I I Not aligned
IEEE 1149 test data input; pin internally pulled to logic 1 when not driven IEEE 1149 test clock input; pin must always be held at logic 1 or logic 0 if not connected to a clock source IEEE 1149 test mode select input; pin internally pulled to logic 1 if not driven IEEE 1149 test reset input (active low); pin internally pulled to logic 1 if not driven When asserted this pin forces the processor into SRAM test mode allowing direct access to the cache SRAMs for memory testing Enables a special SRAM mode for testing the ITB megacell; pull to ground using a 10.7 kΩ, 1% resistor Signal used to indicate that the clock should be stopped; debug signal set inactive to logic 0 on production systems IEEE 1149 test data output; tri-state signal driven only when the TAP controller is in the shift-DR state Not aligned Used for on-chip process monitors; reserved for IC manufacturing only Defines scale end points of the processor temperature sense element on the module; reserved for IC manufacturing only
3.3 V
RAM_TEST
I
ITB_TEST_MODE EXT_EVENT TDO 2.6 V PMO
I I O O
TEMP_SEN[1:0]
N/A
O
438
UltraSPARC-IIi User’s Manual • October 1997
F.2.5
Initialization Interface
TABLE F-5
Pin Reference - Initialization Interface
Type Signal Transitions Aligned w/ Name and Function
Symbol
V
P_RESET_L
I
Not Aligned
For non power-on resets (debug); asynchronous assertion and de-assertion; active low Driven to signal XIR traps (debug); acts as non-maskable interrupt; asynchronous assertion and de-assertion; active low Driven for power-on resets (POR); asynchronous assertion and de-assertion; active low1 Resets PCI subsystem; Asynchronous assertion and monotonic deassertion; also used for UPA64S reset Red Mode Trap Vector Select; pull up if alternate PCcompatible boot vector is required Pullup to enable the 2x function of the CLKA/B PLL; Ecache interface still works at 1/2 the internal processor clock rate Asserted when UltraSPARC-IIi is in clock shutdown mode; use P_RESET_L to re-start
X_RESET_L
I
SYS_RESET_L 3.3 V RST_L RMTV_SEL
I O I
CLKSEL
I
EPD
2.6 V
O
1. SYS_RESET_L must be a clean indication that 3.3 V, 5 V, etc. are stable and within specification. No anomalies may be present, beginning when the power supplies are turned on and extending until the signals are within specification. When signals are within specification, the power supply can transition monotonically to 3.3 V.
Appendix F
Pin and Signal Descriptions
439
F.2.6
PCI interface
TABLE F-6
Pin Reference - PCI interface
Type Signal Transitions Aligned w/ Name and Function
Symbol
V
AD[31:0] CBE_L[3:0] PAR
I/O I/O I/O STS1
Address/Data; multiplexed on same PCI pins. Bus Command and Byte Enables; multiplexed on same PCI pins Parity; even parity across AD[31:0] and CBE_L[3:0] Device Select. Indicates the driving device has decoded the address of the target of the current access; as input, indicates whether any device has been selected Cycle Frame; driven by current master to indicate beginning and end of an access Request; indicates to arbiter that an external device requires use of the bus Grant; indicates to device that bus access has been granted. Initiator Ready; indicates the bus master’s ability to complete the current data phase Target Ready; indicates the selected device’s ability to complete the current data phase Parity error; reports data parity errors System Error; reports address parity errors, data parity errors on special cycles, or any other catastrophic PCI errors Stop; indicates that the current target is requesting the master to stop the current transaction
DEVSEL_L
FRAME_L REQ_L[3:0] GNT_L[3:0] IRDY_L TRDY_L PERR_L SERR_L
STS I PCI_CLK T/S2 STS STS STS O/D
3.3 V (All)
STOP_L
STS
1. Sustained Tri-State. STS is an active low tri -state signal owned and driven by one and only one agent at a time. The agent that drives an STS pin low must drive it high for at least one clock before letting it float. A new agent cannot start driving an STS signal any sooner than one clock after the previous owner tri-states it. A pullup is required to sustain the inactive state until another agent drives it, and must be provided by the motherboard or module. 2. Tri-State Output.
440
UltraSPARC-IIi User’s Manual • October 1997
F.2.7
Interrupt Interface
TABLE F-7
Pin Reference - Interrupt Interface
Type Signal Transitions Aligned w/ Name and Function
Symbol
V
SB_DRAIN
O
Store Buffer Drain. sampled at a 66 MHz PCI_CLK edge; asserted after Interrupts, or by software, to cause outstanding DMA writes to be flushed from buffers PCI_CLK Store Buffer Empty; sampled at 66 MHz PCI_CLK edge. asserted when external APB PCI bus bridge indicates that all DMA writes queued before the assertion of SB_DRAIN have left the bus bridge; Interrupt Number; sampled at 66 MHz PCI_CLK edge; encoded interrupt request
3.3 V SB_EMPTY[1:0] I
INT_NUM[5:0]
I
Appendix F
Pin and Signal Descriptions
441
F.2.8
Memory and Transceiver Interface
TABLE F-8
Pin Reference - Memory and Transceiver Interface
Type Signal Transitions Aligned w/ Name and Function
Symbol
V
MEM_WE_L MEM_CAS_L[1:0] MEM_RAST_L{3:0] MEM_RASB_L[3:0] MEM_DATA[71:0] MEM_ADDR[12:0] XCVR_OEA_L XCVR_OEB_L XCVR_SEL_L XCVR_WR_CNTL[1:0] XCVR_RD_CNTL[1:0] 3.3 V (All)
O O O O I/O O O O O O O CLKA/B
Memory Write Enable; active low Memory Column Address Strobe; active low Memory Row Address Strobe Top; active low Memory Row Address Strobe Bottom, active low Memory Data; bits [71:64] are ECC bits Memory Address, row and column (10 and 11 bit column support) Transceiver Output Enable A; active low Transceiver Output Enable B; active low Transceiver Select; active low; picks high or low half of read data Transceiver Write Control; controls lock enables on internal registers Transceiver Read Control; control clock enables on internal registers Transceiver Clock; all data and control signals are registered by these clocks; multiple outputs to minimize loading effects of 6 transceivers
XCVR_CLK[2:0]
O
442
UltraSPARC-IIi User’s Manual • October 1997
F.2.9
UPA64S Interface
TABLE F-9
Pin Reference - UPA64S Interface
Type O Signal Transitions Aligned withUPA_CLK_POS/NEG Name and Function S_Reply; encoded command to UPA64S device indicates arrival of write data on MEM_DATA[63:0], or command to drive MEM_DATA[63:0] with read data P_Reply: encoded command from UPA64S device that indicates consumption of prior write data, or ability to provide read data System Address; sends 2 cycle address packet to UPA64S slave, or provides internal state debug information Address Valid; asserted during first cycle of two cycle address packet
Symbol S_REPLY[2:0]
V 3.3 V
P_REPLY[1:0]
I
SYSADR[28:0]
I/O1
ADR_VLD
O
1. Not all of SYSADR[28:0] is bidirectional, since SYSADR[14:0] is I/O but SYSADR[28:15] is output only. SYSADR[14:0] is used as an input during RAM_TEST.
Appendix F
Pin and Signal Descriptions
443
444
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
G
ASI Names
G.1
Introduction
This Appendix lists the names and suggested macro syntax for all supported Address Space Identifiers.
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically
Description Value
ASI_AFAR ASI_AFSR ASI_AIUP ASI_AIUPL ASI_AIUS ASI_AIUSL ASI_AS_IF_USER_PRIMARY ASI_AS_IF_USER_PRIMARY_LITTLE ASI_AS_IF_USER_SECONDARY ASI_AS_IF_USER_SECONDARY_LITTLE ASI_BLK_AIUP ASI_BLK_AIUPL
Asynchronous fault address register Asynchronous fault status register Primary address space, user privilege Primary address space, user privilege, little endian Secondary address space, user privilege Secondary address space, user privilege, little endian Primary address space, user privilege Primary address space, user privilege, little endian Secondary address space, user privilege Secondary address space, user privilege, little endian Primary address space, block load/store, user privilege Primary address space, block load/store, user privilege, little endian
4D16 4C16 1016 1816 1116 1916 1016 1816 1116 1916 7016 7816
445
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically (Continued)
Description Value
ASI_BLK_AIUS ASI_BLK_AIUSL ASI_BLK_COMMIT_P ASI_BLK_COMMIT_PRIMARY ASI_BLK_COMMIT_S ASI_BLK_COMMIT_SECONDARY ASI_BLK_P ASI_BLK_PL ASI_BLK_S ASI_BLK_SL ASI_BLOCK_AS_IF_USER_PRIMAR Y ASI_BLOCK_AS_IF_USER_PRIMARY_LI TTLE ASI_BLOCK_AS_IF_USER_SECONDAR Y ASI_BLOCK_AS_IF_USER_SECONDAR Y_LITTLE ASI_BLOCK_PRIMARY ASI_BLOCK_PRIMARY_LITTLE ASI_BLOCK_SECONDARY ASI_BLOCK_SECONDARY_LITTLE ASI_D-MMU ASI_DCACHE_DAT A ASI_DCACHE_DATA ASI_DCACHE_TAG ASI_DMMU ASI_DMMU
Secondary address space, block load/store, user privilege Secondary address space, block load/store, user privilege, little endian Primary address space, block store commit operation Primary address space, block store commit operation Secondary address space, block store commit operation Secondary address space, block store commit operation Primary address space, block load/store Primary address space, block load/store, little endian Secondary address space, block load/store Secondary address space, block load/store, little endian Primary address space, block load/store, user privilege Primary address space, block load/store, user privilege, little endian Secondary address space, block load/store, user privilege Secondary address space, block load/store, user privilege, little endian Primary address space, block load/store Primary address space, block load/store, little endian Secondary address space, block load/store Secondary address space, block load/store, little endian D-MMU Tag Target Register D-cache data RAM diagnostics access D-cache data RAM diagnostics access D-cache tag/valid RAM diagnostics access D-MMU PA Data Watchpoint Register D-MMU Secondary Context Register
7116 7916 E016 E016 E116 E116 F016 F816 F116 F916 7016 7816 7116 7916 F016 F816 F116 F916 5816 4616 4616 4716 5816 5816
446
UltraSPARC-IIi User’s Manual • October 1997
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically (Continued)
Description Value
ASI_DMMU ASI_DMMU ASI_DMMU ASI_DMMU ASI_DMMU ASI_DMMU ASI_DMMU ASI_DMMU_DEMAP ASI_DMMU_TSB_64KB_PTR_RE G ASI_DMMU_TSB_64KB_PTR_REG ASI_DMMU_TSB_8KB_PTR_REG ASI_DMMU_TSB_DIRECT_PTR_REG ASI_DTLB_DATA_ACCESS_REG ASI_DTLB_DATA_IN_REG ASI_DTLB_TAG_READ_REG ASI_ECACHE_R ASI_ECACHE_R ASI_ECACHE_TAG_DATA ASI_ECACHE_W ASI_ECACHE_W ASI_EC_R ASI_EC_R ASI_EC_TAG_DATA ASI_EC_W ASI_EC_W ASI_ESTATE_ERROR_EN_REG ASI_Fl16_P ASI_FL16_PL ASI_FL16_PRIMARY
D-MMU Synch. Fault Address Register D-MMU Synch. Fault Status Register D-MMU Tag Target Register D-MMU TLB Tag Access Register D-MMU TSB Register D-MMU VA Data Watchpoint Register I/D MMU Primary Context Register DMMU TLB demap D-MMU TSB 64K Pointer Register D-MMU TSB 64K Pointer Register D-MMU TSB 8K Pointer Register D-MMU TSB Direct Pointer Register D-MMU TLB Data Access Register D-MMU TLB Data In Register D-MMU TLB Tag Read Register E-cache data RAM diagnostic read access E-cache tag/valid RAM diagnostic read access E-cache tag/valid RAM data diagnostic access E-cache data RAM diagnostic write access E-cache tag/valid RAM diagnostic write access E-cache data RAM diagnostic read access E-cache tag/valid RAM diagnostic read access E-cache tag/valid RAM data diagnostic access E-cache data RAM diagnostic write access E-cache tag/valid RAM diagnostic write access E-cache error enable register Primary address space, one 16-bit floating-point load/store Primary address space, one 16-bit floating-point load/store, little endian Primary address space, one 16-bit floating-point load/store
5816 5816 5816 5816 5816 5816 5816 5F16 5A16 5A16 5916 5B16 5D16 5C16 5E16 7E16 7E16 4E16 7616 7616 7E16 7E16 4E16 7616 7616 4B16 D216 DA16 D216
Appendix G
ASI Names
447
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically (Continued)
Description Value
ASI_FL16_PRIMARY_LITTLE ASI_FL16_S ASI_FL16_SECONDARY ASI_FL16_SECONDARY_LITTLE ASI_FL16_SL ASI_FL8_P ASI_FL8_PL ASI_FL8_PRIMARY ASI_FL8_PRIMARY_LITTLE ASI_FL8_S ASI_FL8_SECONDARY ASI_FL8_SECONDARY_LITTLE ASI_FL8_SL ASI_ICACHE_INSTR ASI_ICACHE_NEXT_FIELD ASI_ICACHE_PRE_DECODE ASI_ICACHE_TAG ASI_IC_INSTR ASI_IC_NEXT_FIELD ASI_IC_PRE_DECODE ASI_IC_TAG ASI_IMMU ASI_IMMU
Primary address space, one 16-bit floating-point load/store, little endian Secondary address space, one 16- bit floating-point load/store Secondary address space, one 16- bit floating-point load/store Secondary address space, one 16- bit floating-point load/store, little endian Secondary address space, one 16- bit floating-point load/store, little endian Primary address space, one 8-bit floating-point load/ store Primary address space, one 8-bit floating-point load/ store, little endian Primary address space, one 8-bit floating-point load/ store Primary address space, one 8-bit floating-point load/ store, little endian Secondary address space, one 8-bit floating-point load/store Secondary address space, one 8-bit floating-point load/store Secondary address space, one 8-bit floating-point load/store, little endian Secondary address space, one 8-bit floating-point load/store, little endian I-cache instruction RAM diagnostic access I-cache next-field RAM diagnostics access I-cache pre-decode RAM diagnostics access I-cache tag/valid RAM diagnostic access I-cache instruction RAM diagnostic access I-cache next-field RAM diagnostics access I-cache pre-decode RAM diagnostics access I-cache tag/valid RAM diagnostic access I-MMU Synchronous Fault Status Register I-MMU Tag Target Register
DA16 D316 D316 DB16 DB16 D016 D816 D016 D816 D116 D116 D916 D916 6616 6F16 6E16 6716 6616 6F16 6E16 6716 5016 5016
448
UltraSPARC-IIi User’s Manual • October 1997
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically (Continued)
Description Value
ASI_IMMU ASI_IMMU ASI_IMMU_DEMAP ASI_IMMU_TSB_64KB_PTR_REG ASI_IMMU_TSB_8KB_PTR_REG ASI_INTR_DISPATCH_STATUS ASI_INTR_RECEIVE ASI_ITLB_DATA_ACCESS_REG ASI_ITLB_DATA_IN_REG ASI_ITLB_TAG_READ_RE G ASI_ITLB_TAG_READ_REG ASI_LSU_CONTROL_REG ASI_N ASI_NL ASI_NUCLEUS ASI_NUCLEUS_LITTLE ASI_NUCLEUS_QUAD_LDD ASI_NUCLEUS_QUAD_LDD_L ASI_NUCLEUS_QUAD_LDD_LITTLE ASI_P ASI_PHYS_BYPASS_EC_WITH_EBIT ASI_PHYS_BYPASS_EC_WITH_EBIT_L ASI_PHYS_BYPASS_EC_WITH_EBIT_LITTLE ASI_PHYS_USE_EC ASI_PHYS_USE_EC_L ASI_PHYS_USE_EC_LITTLE ASI_PL ASI_PNF
I-MMU TLB Tag Access Register I-MMU TSB Register I-MMU TLB demap I-MMU TSB 64KB Pointer Register I-MMU TSB 8KB Pointer Register Interrupt vector dispatch status Interrupt vector receive status I-MMU TLB Data Access Register I-MMU TLB Data In Register I-MMU TLB Tag Read Register I-MMU TLB Tag Read Register Load/store unit control register Implicit address space, nucleus privilege, TL > 0, Implicit address space, nucleus privilege, TL > 0, little endian Implicit address space, nucleus privilege, TL > 0, Implicit address space, nucleus privilege, TL > 0, little endian Cacheable, 128-bit atomic LDDA Cacheable, 128-bit atomic LDDA, little endian Cacheable, 128-bit atomic LDDA, little endian Implicit primary address space Physical address, noncacheable, with side-effect Physical address, noncacheable, with side-effect, little endian Physical address, noncacheable, with side-effect, little endian Physical address, external cacheable only Physical address, external cacheable only, little endian Physical address, external cacheable only, little endian Implicit primary address space, little endian Primary address space, no fault
5016 5016 5716 5216 5116 4816 4916 5516 5416 5616 5616 4516 0416 0C16 0416 0C16 2416 2C16 2C16 8016 1516 1D16 1D16 1416 1C16 1C16 8816 8216
Appendix G
ASI Names
449
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically (Continued)
Description Value
ASI_PNFL ASI_PRIMARY ASI_PRIMARY_LITTLE ASI_PRIMARY_NO_FAULT ASI_PRIMARY_NO_FAULT_LITTLE ASI_PST16_PL ASI_PST16_PRIMARY ASI_PST16_PRIMARY_LITTLE ASI_PST16_S ASI_PST16_SECONDARY ASI_PST16_SECONDARY_LITTLE ASI_PST16_SL ASI_PST32_P ASI_PST32_PL ASI_PST32_PRIMARY ASI_PST32_PRIMARY_LITTLE ASI_PST32_S ASI_PST32_SECONDARY ASI_PST32_SECONDARY_LITTLE ASI_PST32_SL ASI_PST8_P ASI_PST8_PL ASI_PST8_PRIMARY ASI_PST8_PRIMARY_LITTLE ASI_PST8_S
Primary address space, no fault, little endian Implicit primary address space Implicit primary address space, little endian Primary address space, no fault Primary address space, no fault, little endian Primary address space,4 16-bit partial store, little endian Primary address space,4 16-bit partial store Primary address space,4 16-bit partial store, little endian Secondary address space,4 16-bit partial store Secondary address space,4 16-bit partial store Secondary address space,4 16-bit partial store, little endian Secondary address space,4 16-bit partial store, little endian Primary address space, 2 32-bit partial store Primary address space, 2 32-bit partial store, little endian Primary address space, 2 32-bit partial store Primary address space, 2 32-bit partial store, little endian Secondary address space, 2 32-bit partial store Secondary address space, 2 32-bit partial store Secondary address space, 2 32-bit partial store, little endian Secondary address space, 2 32-bit partial store, little endian Primary address space, 8 8-bit partial store Primary address space, 8 8-bit partial store, little endian Primary address space, 8 8-bit partial store Primary address space, 8 8-bit partial store, little endian Secondary address space, 8 8-bit partial store
8A16 8016 8816 8216 8A16 CA16 C216 CA16 C316 C316 CB16 CB16 C416 CC16 C416 CC16 C516 C516 CD16 CD16 C016 C816 C016 C816 C116
450
UltraSPARC-IIi User’s Manual • October 1997
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically (Continued)
Description Value
ASI_PST8_SECONDARY ASI_PST8_SECONDARY_LITTLE ASI_PST8_SL ASI_PSY16_P ASI_S ASI_SECONDARY ASI_SECONDARY_LITTLE ASI_SECONDARY_NO_FAULT ASI_SECONDARY_NO_FAULT_LITTLE ASI_SL ASI_SNF ASI_SNFL ASI_UDB L_CONTROL_R ASI_UDBH_CONTROL_R ASI_UDBH_CONTROL_REG_READ ASI_UDBH_CONTROL_REG_WRITE ASI_UDBH_ERROR_R ASI_UDBH_ERROR_REG_READ ASI_UDBH_ERROR_REG_WRITE ASI_UDBL_CONTROL_REG_READ ASI_UDBL_CONTROL_REG_WRITE ASI_UDBL_ERROR_R ASI_UDBL_ERROR_REG_READ ASI_UDBL_ERROR_REG_WRITE ASI_UDB_CONTROL_W ASI_UDB_CONTROL_W ASI_UDB_ERROR_W ASI_UDB_ERROR_W ASI_UDB_INTR_R ASI_UDB_INTR_R
Secondary address space, 8 8-bit partial store Secondary address space, 8 8-bit partial store, little endian Secondary address space, 8 8-bit partial store, little endian Primary address space,4 16-bit partial store Implicit secondary address space Implicit secondary address space Implicit secondary address space, little endian Secondary address space, no fault Secondary address space, no fault, little endian Implicit secondary address space, little endian Secondary address space, no fault Secondary address space, no fault, little endian External UDB Control Register, read low External UDB Control Register, read high External UDB Control Register, read high External UDB Control Register, write high External UDB Error Register, read high External UDB Error Register, read high External UDB Error Register, write high External UDB Control Register, read low External UDB Control Register, write low External UDB Error Register, read low External UDB Error Register, read low External UDB Error Register, write low External UDB Control Register, write high External UDB Control Register, write low External UDB Error Register, write high External UDB Error Register, write low Incoming interrupt vector data register 0 Incoming interrupt vector data register 1
C116 C916 C916 C216 8116 8116 8916 8316 8B16 8916 8316 8B16 7F16 7F16 7F16 7716 7F16 7F16 7716 7F16 7716 7F16 7F16 7716 7716 7716 7716 7716 7F16 7F16
Appendix G
ASI Names
451
TABLE G-1 ASI Name or Macro Syntax
ASI Names—listed alphabetically (Continued)
Description Value
ASI_UDB_INTR_R ASI_UDB_INTR_W ASI_UDB_INTR_W ASI_UDB_INTR_W ASI_UDB_INTR_W ASI_UPA_CONFIG_REG
Incoming interrupt vector data register 2 Interrupt vector dispatch Outgoing interrupt vector data register 0 Outgoing interrupt vector data register 1 Outgoing interrupt vector data register 2 UPA configuration register
7F16 7716 7716 7716 7716 4A16
452
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
H
Event Ordering on UltraSPARC-IIi
H.1
Highlight of US-IIi specific issues
UltraSPARC IIi meets the requirements of the SPARC V-9 and SUN4U memory models. Some important points that may not be obvious:
s
The membar instruction cannot be used to guarantee that a noncacheable store has completed to a device.
However, a feature of UltraSPARC-IIi is that explicit membar instructions can be used to guarantee that PCI activity has progressed to the primary PCI buses. However progress to the UPA64S interface cannot be guaranteed with membars.
s
A single cacheable mutex semaphore should not be used to control shared access to a PCI device when shared access involves the processor and a PCI DMA master. A robust solution might use a passed token instead in a a single reader and single-writer lock exchange. This solution meets the PCI producer/consumer model.
There is a lack of SMP-like ordering because a PCI DMA master can short-circuit the global ordering mechanism by direct peer-to-peer access to the device on its local bus. This could allow the PCI DMA master to issue stores to the device that jump ahead of uncompleted activity from the processor. This issue exists because of the hierarchy of buses in the PCI domain, and also because of the fact that the membar instruction cannot guarantee the completion of a noncacheable store.
453
s
A single cacheable mutex semaphore is ideal for controlling similarly shared access to cacheable memory or the UPA64S interface, since the PCI DMA master cannot jump ahead of any globally ordered CPU activity, and SMP-like global ordering is enforced with the ordering point inside UltraSPARC-IIi. The SUN4U architecture has no mechanism for ordering PCI PIO and DMA activity. DMA event completion is ordered with interrupts, or possibly with a cacheable semaphore as noted above.
s
H.2
Review of SPARC V9 load/store ordering
The SPARC V9 Architecture began with a straightforward set of “sequencing” memory barrier instructions (membars) to be used by software to guarantee that prior program order loads and/or stores would be globally ordered after future program order loads and/or stores, for a single processor. This global order could be considered “created” when the system could guarantee that the loads and stores would eventually complete at their final destination with effects consistent with this global order. This known global ordering of events is necessary in multi-processor systems when processors share access to common resources. The formal definition of order is more abstract than this description but this language follows the behavior of typical hardware implementations. Complicating the issue for performance reasons, implementations typically introduce additional queues for noncacheable operations that can operate in parallel to the ordering mechanisms for cacheable operations. Requiring the membars to order both cacheable and noncacheable events was believed to create a performance problem, since some membars exist implicitly for certain memory models. Consequently, V9 organized that the sequencing membars apply separately within the cacheable and noncacheable domains. To order between domains, without the additional overhead of Membar #Sync, a Membar #MemIssue instruction was created. Membar #Sync is additionally constrained to guarantee that the effects of any exceptions have been ordered.
According to V9:
454
UltraSPARC-IIi User’s Manual • October 1997
“All memory reference operations appearing prior to the MEMBAR #MemIssue must have been performed before any memory operation after the MEMBAR #MemIssue may be initiated.” The word “performed” may have been purposely chosen to be nebulous! This instruction is known as a “completion” membar, and the apparent implication was that subsequent load/stores would be stalled until prior loads were completed, and prior stores were completed to the destination (device). However, the SUN4U architecture recognized store “completion” as a possible performance problem. and relaxed the definition to mean that load/store issue would be stalled until all prior loads and stores had been globally ordered. This global order would be preserved out to the device, which was responsible for completing them in that order. No side-effects between devices were allowed, so this model meets the overall goals. If knowledge of store completion to the device were really necessary for some reason, perhaps because of side-effects, SUN4U requires software to issue a load to that device (into some implementation-specific address) and wait for its completion. The device is responsible for completing the effects of all prior load/stores before completing that load. In short, the SUN4U requirement for a Membar #MemIssue is the same as that for a sequencing Membar with #StoreStore, #StoreLoad, #LoadLoad, #LoadStore all set, but with the effects applied across both cacheable and noncacheable domains. UltraSPARC I and II actually implement a more conservative approach to the explicitly coded sequencing Membars. The sequencing effect applies equally against cacheable and noncacheable loads and stores. (This is not true for the implicit sequencing membars in the memory models). With PSTATE.MM==TSO, UltraSPARC I and II will guarantee all stores, both cacheable and noncacheable are ordered globally so as to complete in program order. This is described as an implicit Membar #MemIssue in the User ’s Manual. With PSTATE.MM==PSO or RMO store ordering is not necessarily preserved, notably between cacheable and noncacheable stores, and between cacheable block store commits and other cacheable stores. Note that global ordering may also be important in all memory models if noncacheable loads have side-effects. For the noncacheable domain, the DMMU supports a bit per page mapping called the E-bit, that has the same architecturally specified effect as having a membar with all the sequencing bits set, between loads and stores. That is, a strong sequential order is created and preserved out to devices. However, the E-bit only orders load/ store within the noncacheable domain.
Appendix H
Event Ordering on UltraSPARC-IIi
455
H.2.1
Ordering load/store Activity Out To The Primary PCI bus
This activity is not a requirement of the software model, but it is a design feature that might be minimally useful in debug situations. UltraSPARC I and II membars only guarantee that PIO stores have completed as far as the processor data bus system, not to the SBUS or any PCI bus. As noted the global order created is preserved from that point on. Since the software model has no ordering between DMA and PIO on the PCI bus, there should not be any case of software using a membar #sync for guaranteeing some ordering of events on the PCI bus. The SUN4U software model description states: “There are times that it is desirable to know if an I/O access has completed....” “Any store queue must have an address associated with it that can be read by a processor to see if previously issued stores have completed, this may be the address of a safe-to-read status or control register...” “Code that wishes to see if the path from the processor to a device has been cleared can do so by reading the synchronization address associated with the buffer closest to the target device.” UltraSPARC-IIi also does not guarantee that writes to UPA64S have completed all the way to the UPA64S interface with a membar #sync. Since UPA64S is a single master interface, no multi-master order issues exist. The software model instead uses loads to determine store completion all the way to the UPA64S internals.
456
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
I
Observability Bus
UltraSPARC-IIi implements an observability bus to assist in bringing up the processor and its associated systems. The bus can also be used for performance monitoring and instrumentation.
I.1
I.1.1
Theory of Operation
Muxing
At any one time, one group of 15 signals out of five possible groups—75 total signals—is selected for output to the SYSADR[14:0] pins of UltraSPARC-IIi. This selection is controlled by an ASR register. Since SYSADR is used for UPA64S addresses, the observability information is not available for the two UPA clocks (six processor clocks) of a UPA64S address packet, and for one more UPA clock after that (3 processor clocks).This period is indicated by the assertion of ADR_VLD for the first 3 processor clocks of the period. After the nine processor clocks have expired, SYSADR[14:0] can again change state every processor clock instead of being aligned to UPA clocks. To avoid sending 300 Mhz signals to UPA64S during normal operation, program the select to choose all 1’s. This selection also limits EMI by disabling the test L5CLK outputs (CPU and PCI) on UltraSPARC-IIi. The first group (group 0) is chosen to be the most useful debug group, since this is the default group selected upon POR. There is no overlap of signals between groups.
457
I.1.2
Dispatch Control Register
The Dispatch Control Register, ASR 0x18, enables some performance features related to instruction dispatch, and controls the output of internal signals to UltraSPARC-IIi SYSADR[14:0] pins for help in chip debug and instrumentation.
GS 63 FIGURE I-1 6 3
MVX 2
rsvd 1
MS 0
Dispatch Control Register (ASR 0x18
G: S: Group select bits. Selects the group of signals driven out on SYSADR during cycles not used by UPA64S address packets. All unused encodings cause undefined results; zero after POR.
TABLE I-1 GS
Group Select Bits
Group
000 001 010 011 100 111
0 1 2 3 4 ALL1
MVX: IEU.movx_enable—Controls a performance enhancement (compared to US-I)
for movx instructions. If set, stops movx instruction dispatch if there is a valid load instruction in the E-stage. (performance enhancement); zero after POR.
MS: IEU.multi_scalar—Multi-Scalar Dispatch Control. If cleared, instruction dispatch is forced to a single instruction per group; zero after POR.
Recommended initialization for normal system operation is 0x3D.
458
UltraSPARC-IIi User’s Manual • October 1997
I.1.3
Timing
All signals appear on the pins three stages after they are valid within UltraSPARC-IIi. Each signal is buffered with a rising-edge-triggered D-flip-flop.
signal
d
q logic
d
q d q
obs_tap_bus_N[]
Block
SPR
I/O Cell
FIGURE I-2
Diagram of Observability Bus Logic.
I.1.4
Signal List
Groups are divided roughly into:
s s s s s s
Group 0: Primary pipe pins Group 1: Program counter Group 2: Prefetch unit. Group 3: Load-store unit, E-cache unit. Group 4: Special Purpose Register block signals ALL1: Bus is driven high at all times
I.1.4.1
Group 0
Primary pipeline signals (default group)
s
obs_tap_bus_0[2:0]= num_complete = f(tr.trctrl.trpc.trap_*_ins_comp_w).
Appendix I
Observability Bus
459
The number of instructions completed in W, from zero through four inclusive. Help instructions are counted only once, but they differ in the exact cycle that gets counted because of the way the valid bits behave for different instructions. For example, CASA is counted on W1 of the help==00 cycle, while MULX is counted on W1 of the help==11 cycle.
s
obs_tap_bus_0[4:3] = ieu_dispatched_g[3:0] compressed to 2 bits The number of instructions dispatched into the pipeline by G-logic. 0==no instructions dispatched 0x1 == one instruction dispatched 0x2 == two instructions dispatched 0x3 == three or four instructions dispatched.
s
obs_tap_bus_0[5]= lsu_stall_v4_e Stall the e-stage of the pipe when an instruction requires data from an earlier load operation that is not yet available. Can happen due to D$ miss, read-after-write hazard, sign extension on a D$ hit, load buffer not empty, etc.
s
obs_tap_bus_0[6]= flop(tr_microtrap_n3 | ieu_flush_n3) Indicates a flush or microtrap is being taken. obs_tap_bus_0[6] and obs_tap_bus_0[8] should not be active together and should always be followed by bit 7 going active two to many cycles later before either go active again. Both should be single cycle pulses.
s
obs_tap_bus_0[7]= flop(ieu_done || ieu_retry) Indicates that trap logic is delivering a PC (and NPC for retries) from which to begin fetching after POR, traps, DONE/RETRY inst flushes, microtraps, etc.
s
obs_tap_bus_0[8]= flop(ieu_traptaken_n3) The trap unit has determined that an N3 instruction should trap, and signals the pipeline to take the trap. obs_tap_bus_0[6] and obs_tap_bus_0[8] should not be active together and should always be followed by bit 7 going active 2 to many cycles later before either go active again. Both should be single cycle pulses.
s
obs_tap_bus_0[9]= finish_fpop A floating point operation has come off the queue.
(‘FGC.c_f1_write[0] | fdiv_finish)
s
obs_tap_bus_0[10]= finish_load (NEEDS FIX IN RTL--LOGIC IN EX) A floating point operation has come off the queue
s
obs_tap_bus_0[11]= pdu_bad_pred_c
460
UltraSPARC-IIi User’s Manual • October 1997
This C-stage signal is asserted when the direction of a conditional branch has been mispredicted or the target address of a register-indirect jump (JMPL or RETURN) has been mispredicted. Note: obs_tap_bus_2[5] (pdu_br_resol_c) should be asserted at the same time.
s
obs_tap_bus_0[14:12]= E$ arbitration // ecache fills or ownership etag/edata writes ((dxfsm_ecache_req & ~dxfsm_ecache_busy) ? 3’d1 : 3’d0) | // copybacks or invalidates ((snp_ecache_req & ~snp_ecache_busy) ? 3’d2 : 3’d0) | // writebacks or block stores ((trfsm_ecache_req & ~trfsm_ecache_busy) ? 3’d3 : 3’d0) | // data back for noncacheable loads or the sdb data transfer nc stores ((nc_ecache_req & ~nc_ecache_busy) ? 3’d4 : 3’d0) | // noncacheable or cacheable loads/bloads, asi stores to sdb/ecache (ldb_win ? 3’d5 : 3’d0) | // noncacheable or cacheable stores/bstores, asi loads to sdb/ecache (stb_win ? 3’d6 : 3’d0) | // tag checks for stb (sttag_win ? 3’d7 : 3’d0);
I.1.4.2
Group 1
Program counter
s
obs_tap_bus_1[11:0]= pc[13:2]. These are bits [13:2] (the word address) of the D-stage “fetch PC”. (LSB of the virtual page number + page offset). RTL use: In the D-stage, this PC (bits [43:13]) is being translated by the ITLB. It is also the PC that will be enqueued in the GPCQ (G-stage PC Queue) in the next cycle (when the associated instructions are enqueued in the IBuffer), if this fetch starts a new PC segment.
s
obs_tap_bus_1[12]= pfc_utlb_miss
Appendix I
Observability Bus
461
This D-stage signal is asserted when the fetch PC crosses a page boundary (e.g. by jumping to a different page), the prefetcher stalls 1 cycle to wait for the ITLB translation.
s
obs_tap_bus_1[13]= function of (pfc_va_valid, pfc_cancel_itlb) When this signal is asserted in the D2 stage, the results (hit/miss/exception and the physical address) of the ITLB translation performed the previous cycle (D stage) are valid and used.
s
obs_tap_bus_1[14]= function of (pfc_imu_exc, pfc_imu_miss) This signal is asserted in the D2 stage (when a uTLB miss has occurred in D, forcing the prefetcher to stall for the ITLB translation) if the VA translation has caused an exception (caused an ITLB miss or an ITLB access exception, or the VA is illegal--in the “hole”). This signal is already qualified by the “cancel” signal, pdu_cancel_itlbt, so that it will not be asserted if the translation will not actually be needed.
I.1.4.3
Group 2
Prefetch unit, caches
s
obs_tap_bus_2[1:0] = pdu_i*_valid (compressed to 2 bits) Encoded count of number of valid instructions in the IBuffer. 0==no instructions dispatched, 0x1 == one instruction dispatched, 0x2 == two instructions dispatched, 0x3 == three or four instructions dispatched.
s
obs_tap_bus_2[2] = fetch_stall = pfc_ignore_fetch || ibcm_full || gpcq_qfull If this D-stage signal is asserted, no instructions will be enqueued in the IBuffer next cycle. It will be asserted if the IBuffer or GPCQ is full, or for prefetch stall events: NFA-PC mismatches, SP mispredictions, uTLB misses, branch mispredictions, or cache stalls (for E-cache accesses, snoops, ASI accesses, or flushes).
s
obs_tap_bus_2[3] = pfc_non_fetch Asserted when the instruction prefetcher is stalled because the I-cache is busy (for an E-cache fetch, a snoop, ASI access, or flush).
s
obs_tap_bus_2[4] = pdu_br_taken_c When obs_tap_bus_2[5] (pdu_br_resol_c) is asserted (i.e. a branch is resolved), this C-stage signal is asserted when a conditional branch (Bicc, BPcc, FBfcc, FBPfcc) is taken.
s
obs_tap_bus_2[5] = pdu_br_resol_c
462
UltraSPARC-IIi User’s Manual • October 1997
Asserted when a DCTI (Bicc, BPcc, FBfcc, FBPfcc, JMPL, RETURN) reaches the C stage. Note: obs_tap_bus_0[11] (pdu_bad_pred_c) should only be asserted when this signal is asserted. obs_tap_bus_2[4] is only valid when this signal is asserted.
s
obs_tap_bus_2[6] = pc.pcgen_ctl.pfc_spmiss_d This D-stage signal is asserted when a “Set misprediction” (SP miss) occurs (that is, when the instructions were fetched from the wrong bank of the I-cache, so the prefetcher must redo the fetch). This should cause the prefetcher to stall for 2 cycles. Note: as a result, obs_tap_bus_2[2] (fetch stall) should be asserted in the same cycle.
s
obs_tap_bus_2[7] = imux_pcmiss_d1_f This D-stage signal is asserted when there is an NFA-PC mismatch (that is, when the “next fetch address” from the NFRAM, used for the F-stage I-cache fetch, mismatches with the actual fetch PC, so the prefetcher must redo the fetch). This is sometimes referred to as a “PC miss”. The prefetcher should stall for 2 cycles. Note: as a result, obs_tap_bus_2[2] (fetch stall) should be asserted in the same cycle.
s
obs_tap_bus_2[8] = ibd_pcrel_taken_d D-stage decode signal for the instructions from the current I-cache (or E-cache) fetch. Indicates that there is a PC-relative branch in the current fetch that is predicted-taken.
s
obs_tap_bus_2[9] = ibd_regbr_d D-stage decode signal for the instructions from the current I-cache (or E-cache) fetch. Indicates that there is a register-indirect jump (JMPL or RETURN) in the current fetch.
s s
obs_tap_bus_2[10] = (copy of obs_tap_bus_2[0]) obs_tap_bus_2[11] = iblock.icc_update_icache This signal is asserted when the I-cache or NFRAM should be updated for a cache fill (it is a component of the RAM write-enables).
s
obs_tap_bus_2[12] = imu_stop IMU has encountered an exception, and will be suspended until told by the pipeline that the exception has been cleared by the instruction being annulled or flushed as it goes down the pipe, or reaching W stage and causing a trap. The imu_stop is cleared whether the instruction causes a trap or not. If imu_stop is left high and the CPU is hung, check for PDU waiting on a request to the ECU. Otherwise, look for cases of the exception instruction getting annulled or flushed without notifying the IMU.
Appendix I
Observability Bus
463
s
obs_tap_bus_2[13]= write D$ Active when any byte of D$ is being modified, either from a store or D$ fill. For D$ misses, the D$ and D$ tags are written assuming that the data is a hit in the E$. If there is an E$ miss, the D$ will be updated properly when the data for the E$ miss is returned from the system.
s
obs_tap_bus_2[14]= lsu_tag2_we D$ tag write enable.
I.1.4.4
Group 3
Load-store unit, E$ unit
s
obs_tap_bus_3[3:0]= Snoop information
{ecu_pd_snoop_req, pdu_busy, ecu_ls_snoop_req, lsu_ec_dcache_busy};
s
obs_tap_bus_3[7:4]= E$ request/cancel information If there is a read and it is not one of the following, it is the PDU (cacheable or noncacheable). Block loads and stores that hit the ecache will be distinctive by their OE/WE pattern (incrementing addresses). {ecu_ls_cancel_all, ecu_pd_cancel_all, ecu_ls_cancel_tag, ecu_ls_clear_tag};
s
obs_tap_bus_3[8]= enq_n1 Load buffer gets an entry enqueued. Often an n1-stage load cannot return data and must be put on the load buffer.
s
obs_tap_bus_3[9]= ldb_zero_entries The load buffer is empty.
s
obs_tap_bus_3[10]= raw_hit_target_n1 The D$ access has hit. This is a “raw” signal and is based on the current state of the D$. It is possible that older loads in the Load Buffer can “adjust” the load/ store in n1-stage into either a hit or miss based on how these older loads will change the state of the D$ by bringing in new data/overwriting old data.
s
obs_tap_bus_3[11]= lsu_use_other lsu_use_other indicates from where load data is returning. If asserted, data is coming from the “other” bus. If deasserted, data is coming directly from the D$. The “other” bus transfers data for:
s
D$ misses NC loads diagnostic loads (load alternates) of external resources (e.g. SDB registers, E$ data RAM, E$ tag RAM)
s
s
464
UltraSPARC-IIi User’s Manual • October 1997
s
loads (again, load alternates) of internal resources (e.g. I$, DMMU, IMMU, D$, ECU internal registers, etc.).
In addition, it also carries data on D$ hits for signed loads (ldsb/ldsba, ldsh/ ldsha, ldsw/ldswa) one cycle delayed. If a subsequent load is attempting to return data in the cycle following the signed load’s D$ hit, it is forced to use the “other” bus and to be delayed one cycle as well (this scenario is often referred to as “delayed return mode”).
s
obs_tap_bus_3[12]= lsu_stb_dec_count An entry is dequeueing from the store buffer. This signal is asserted the cycle after the Store Buffer valid bit is deasserted. For writes to the E$, this is the cycle that the address is being driven from UltraSPARC-IIi to the E$ RAMs.
s
obs_tap_bus_3[13]= stb_block_ldb_ec_req Store buffer gets priority over the load buffer for E$ request signals. No Load requests to the E$ can be made in this cycle, because the Store Buffer has assumed priority to “drain” as it has hit a “high watermark” in the number of entries it contains.
s
obs_tap_bus_3[14]= sab_addr_valid[0] Valid bit for store buffer entry 0. (Store buffer is not empty.)
I.1.4.5
Group 4
Information from EX on CWP state and changes.
s s s
obs_tap_bus_4[7:0]= spr_cwpread_g[7:0] obs_tap_bus_4[10:8]= sprcntl_cwp_muxsel_g[2:0] obs_tap_bus_4[14:11] {sprcntl_cwpchange_e, sprcntl_cwpchange_c, sprcntl_cwpchange_n1, sprcntl_cwpchange_n3}
I.1.4.6
ALL1
When this group is chosen the observability bus is driven high at all times. This reduces the power consumption of UltraSPARC-IIi since the pins are not toggling. The CPU and PCI test L5CLK’s are also disabled.
Note – The ALL1 group is not the default group. If this feature is required in the
system level environment the boot/initialization code must set GS bits accordingly.
Appendix I
Observability Bus
465
I.1.5
Other UltraSPARC-IIi Debug Features
In addition to the observability bus, the default value of the ECAD (address to the data SRAMS) is pdu_pa[21:4], which is the PDU’s prefetch address
466
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
J
List of Compatibility Notes
The following text is a list of the comp[atibility notes that appear through the body of this manual. The page number for the original compatibility note in the body of the manual appears at the end of each entry in this list.
Note 1: Note 2:
A read of any addresses labelled “Reserved” above returns zeros, and writes have no effect. 52 If Configuration cycles are generated with compressed (E-bit==0) byte or halfword stores, or with random byte enable patterns using the PSTORE instruction, UltraSPARC-IIi does not guarantee that AD[1:0] points to the first byte with a BE asserted. Also, while not addressed by the PCI 2.1 specification UltraSPARC-IIi can generate multi-databeat configuration reads and writes. 85
Note 3: Note 4:
There are no time out errors during table walk for the UltraSPARC-IIi IOM. 104 Bits in the DMA UE AFSR/AFAR are set, and the PA of the TTE entry is saved on Invalid, Protection (IOM miss), and TTE UE errors. This should aid debugging of software errors. If the Protection error had an IOM hit, the translated PA from the IOM is saved instead of the PA of the TTE entry. This may occur if a prior DMA read caused the IOM entry to be installed. 105 Prior PCI-based UltraSPARC systems implemented a true LRU scheme. 105 The IGN on UltraSPARC-IIi is not programmable, and fixed to 0x1F. 110 UltraSPARC-IIi does not send interrupts to any devices. A write to these registers has no effect. 121 UltraSPARC-IIi does not send interrupts to any devices. A read of this register always returns zeros. 122 UltraSPARC-IIi only supports the interrupt data that were present in prior UltraSPARC-based systems; that is, bits 10:0 (INR) of ASI_SDB_INTR(0). All other bits are read as 0. 123
Note 5: Note 6: Note 7: Note 8: Note 9:
467
Note 10:
Prior UltraSPARCs may have provided the first two registers at the same time. If code depends upon this unsupported behavior it must be modified for UltraSPARC-IIi. 175 When the processor is reset, UPA64S, PCI, and APB are also reset. 180 Referenced and Modified bits are maintained by software. The Global, Privileged, and Writable fields replace the 3-bit ACC field of the SPARC-V8 Reference MMU Page Translation Entry. 208 The UltraSPARC-IIi MMU performs no hardware table walking. The MMU hardware never directly reads or writes to the TSB. 211 The single context register of the SPARC-V8 Reference MMU has been replaced in UltraSPARC-IIi by the three context registers shown in Figures 15-4, 15-5, and 15-6. 223 In UltraSPARC-IIi the virtual address is longer than the physical address; thus, there is no need to use multiple ASIs to fill in the high-order physical address bits, as is done in SPARC-V8 machines. 234 UltraSPARC automatically caused the reset through the UPA. UltraSPARC-IIi currently does not cause an automatic reset. 240 If an E-cache data parity error occurs during a write-back, uncorrectable ECC is not forced to memory. However, the error information is logged in the AFSR and a disrupting data_access_error trap is generated. 244 If PER is disabled, UltraSPARC-IIi does not set DPE if it detects a parity error on PIO reads. This is inconsistent with the PCI 2.1 spec. 245 If PER is disabled, UltraSPARC-IIi does not set DPE if it detects a parity error on DMA writes. This is inconsistent with the PCI 2.1 spec. 246 A new feature for UltraSPARC-IIi, is that the VA of the offending DMA access is logged in the PCI DMA UE AFSR and AFAR, with the a bit set for identification as a DMA translation error. 247 UltraSPARC-IIi does not Target Abort on a a parity error resulting from a DMA read of E-cache. UltraSPARC caused a UE at the receiver of the data. Currently it is only reported with the same priority/trap as WP (but CP bit set). 255 UltraSPARC-IIi causes a Deferred Trap similarly to UltraSPARC for ETS, without a system reset. Software can determine if a system reset is necessary. 255 The SDB name is inherited from UltraSPARC. It logs information about memory errors caused by the CPU core. Only the SDBH register is used. Current Solaris software interrogates if SDBL is non-zero, and ORs in a 1 to the logged pa[3] (which is always zero on UltraSPARC, but valid on UltraSPARC-IIi). 255
Note 11: Note 12:
Note 13: Note 14:
Note 15:
Note 16: Note 17:
Note 18: Note 19: Note 20:
Note 21:
Note 22: Note 23:
468
UltraSPARC-IIi User’s Manual • October 1997
Note 24: Note 25: Note 26: Note 27: Note 28: Note 29: Note 30:
There is no Wakeup Reset support for power management, unlike that in prior UltraSPARC-based systems. 265 Prior UltraSPARC Systems used other means for controlling these functions. 277 APB has a similar additional state for each of its PCI busses. See the APB User ’s Manual for details. 293 This device ID is different from that of prior PCI-based UltraSPARC systems. 302 A value of 0 means there is no latency timeout. 305 ERR and ERRSTS are not present in prior PCI-based UltraSPARC systems. 309 Unlike prior PCI-based UltraSPARC systems, UltraSPARC-IIi arbitrates between IOMMU CSR access and DMA access. This property may allow software more flexibility. 312 The Used bit does not exist in prior PCI-based UltraSPARC systems, and is used by the pseudo-LRU replacement algorithm. 312 The IGN on UltraSPARC-IIi is not programmable for the Partial Interrupt Mapping Registers, and is fixed to 0x1f. 314 There is no RECEIVED state for DMA CE, DMA UE, or PCI Error Interrupts. They cause their interrupt FSMs to go from the IDLE to the PENDING state directly, when present and enabled. 316 Note the “Graphics Int State” and Expansion UPA64S Int State” bits are moved from bits 38 and 39 (position in prior UltraSPARC systems) to bits 34 and 35 respectively. 322 The UltraSPARC-IIi PCI bus is hardwired to Bus Number == 0 326 UltraSPARC-IIi aliases Functions 1-7 of its PCI Configuration space to its Function 0 PCI Configuration space. (Bus 0, Device 0). The PCI specification requires that zeros be returned and stores ignored. Since this address space is only accessible to UltraSPARC-IIi PIO instructions, specifically boot PROM code, this aliasing should not be problematic because the boot PROM should never reference the UltraSPARC-IIi Function 1-7 addresses (see “Type 0 Configuration Address Mapping” on page 325 for the address decode scheme). 326 Unlike prior PCI-based UltraSPARC systems, UltraSPARC-IIi does not use bit 31 of the PCI address for outgoing memory transactions, or bit 17 for outgoing IO transactions. APB also similarly preserves bits 31 and 17. 327 Unlike prior PCI-based UltraSPARC systems, Pass-through does not zero PCI_Addr[31] 329 Prior PCI-based UltraSPARC systems used PCI_Addr, but note that [40:34] are all 1’s for UPA64S addresses. 330
Note 31: Note 32: Note 33:
Note 34:
Note 35: Note 36:
Note 37:
Note 38: Note 39:
Appendix J
List of Compatibility Notes
469
Note 40:
A PCI DMA UE interrupt is generated whenever a primary DMA UE or Translation Error bit is set, even if by a CSR write. Ensure that software clears the AFSR before clearing the interrupt state and re-enabling the PCI Error Interrupt. (This behavior is similar to that of the ECU AFSR). 331 This feature is absent in prior PCI-based UltraSPARC systems but should be compatible with existing Solaris code. 332 A DMA CE interrupt is generated whenever a primary DMA CE bit is set, even if by a CSR write. Ensure that software clears the AFSR before it clears the interrupt state and re-enables the PCI Error Interrupt. (This behavior is similar to that of the ECU AFSR). 334 Because of the smaller external cache data and tag, some adjustments are made to these diagnostic accesses. 394
Note 41: Note 42:
Note 43:
470
UltraSPARC-IIi User’s Manual • October 1997
APPENDIX
K
Errata
K.1
Overview.
This document contains a list of errata for 1.2 and above of the UltraSPARC-IIi CPU.
K.2
Erratum 32:
Errata Created by UltraSPARC-I
Load from ITLB or DTLB may return wrong data if the load is after a store instruction to ITLB or DTLB that traps The following is required to occur:
s
s
s
Store to ASIs ASI_ITLB_DATA_ACCESS_REG or ASI_DTLB_DATA_ACESS_REG (ITLB or DTLB entries) traps. Load from ASIs ASI_ITLB_DATA_ACCESS_REG or ASI_DTLB_DATA_ACESS_REG (ITLB or DTLB entries). No intervening store instructions between the above Store and Load.
471
For example:
stx %reg,[..]ASI ;if this instruction traps for some reason ASI for ITLB 0x55 and for DTLB 0x5d ;the instructions dispatched following store ;does not contain any st or st to alternate instruction ;Reads TLB entry ASIs 0x55, 0x56 (for ITLB ;ASI 0x5d, 0x5e (for DTLB)
.... space
ldx [..]ASI %reg
In the IMU/DMU, the address of the internal register to be written by a store is latched after the store is dispatched. A wait state is entered until the time the data is actually written. If this instruction traps, the control logic does not reset and remain in this wait state. A subsequent load from TLB entries can be corrupted by this wait state, resulting in the use of the internal address associated with the prior store instead of that from the load. However, this wait state is cleared by any store instruction. Hence the problem does not exist if a store is executed between the trapping store and the load.
Software workaround: Add a Store instruction to any address space before loads from ITLB or DTLB, if none already exists. Erratum 45:
DONE/RETRY/SAVED/RESTORED with illegal fcn field executed in nonprivileged mode take privileged_opcode trap rather than illegal_instruction trap. The following instruction conditions generate a privileged_opcode trap rather than the specified illegal_instruction trap.
DONE for fcn = 2..31 executed in nonprivileged mode RETRY for fcn = 2..31 executed in nonprivileged mode SAVED for fcn = 2..31 executed in nonprivileged mode RESTOREDfor fcn = 2..31 executed in nonprivileged mode
Software workaround: The opcode can be recognized by software to emulate the proper illegal_instruction behavior. This can be done with SPARC code in the privileged_opcode trap handler that does the following:
PRIVILEGED_OPCODE_HANDLER: rdpr ld setx and %tpc, %g1 [%g1], %g2 0xc1f80000, %g3, %g4 %g4, %g2, %g4 ! %g4 has op/op3 of trapping instr.
472 UltraSPARC-IIi User’s Manual • October 1997
setx and srl
0x3e000000, %g3, %g6 %g6, %g2, %g6 %g6, 25, %g6 ! %g6 has fcn of trapping instr.
check_illegal_saved_restored: setx subcc bne subcc bge nop check_illegal_done_retry: setx subcc bne subcc bge nop not_illegal: 0x81f00000, %g3, %g5 %g4, %g5, %g0 not_illegal %g6, 2, %g0 ! illegal fcn value? ! done/retry opcode? 0x81880000, %g3, %g5 %g4, %g5, %g0 ! saved/restored opcode?
check_illegal_done_retry %g6, 2, %g0 ILLEGAL_HANDLER ! illegal fcn value?
ILLEGAL_HANDLER
Erratum 47:
JMPL instruction at boundary of Virtual address hole sign-extends %rd. Virtual addresses between:
0x0000 0800 0000 0000 and 0XFFFF F7FF FFFF FFFF
inclusive, are termed out of range. This range is referred to as the Virtual address hole and is described in Section 4.2, “Virtual Address Translation” on page 23; also see Section 14.1.7, “44-bit Virtual Address Space” on page 184. The following instruction sequence causes %rd to be loaded with the wrong value:
pc = 0x000007FF.FFFFFFFCjmpl address, %rd pc = 0x00000800.00000000
The %rd is saved as: 0xFFFF F800 0000 0000, when it should be the first address in the Virtual address hole: 0x0000 0800 0000 0000. The failure would be that an erroneous jmpl at the boundary (which should trap if the correct return address were used) would create a valid instead of invalid return address. This valid return address would not trap as a “VA hole” PC.
Software workaround: US-I errata require the OS to not map the 4 GB of instruction space immediately above and below the VA hole, so the OS would not map the following 4 GB ranges:
lower range: 0x0000 0700 0000 0000 to 0x0000 07FF FFFF FFFF
Appendix K Errata
473
upper range: 0xFFFF F800 0000 0000 to 0xFFFF F8FF FFFF FFFF
Since the instruction address at the boundary is never mapped, a valid instruction is never executed at that PC.
Erratum 48:
DONE/RETRY with TL=0 causes a privileged rather than an illegal instruction trap. The SPARC Architecture Manual, Version 9 says an illegal instruction trap should be taken. Instead, a privileged trap is taken.
Erratum 49:
ASI’s 0x5c/5d/5e all cause ft[2] in the DMMU SFSR to be set according to the tlb entry. The UltraSPARC -I/II User’s Manual says that the ft[2] bit of the D-MMU Synchronous Fault Status Register (loaded on traps) is set for Atomics (including 128-bit atomic load) to page marked uncacheable, and that the bit is zero for internal ASI accesses, except for atomics to DTLB_DATA_ACCESS_REG (0x5D), which update according to the TLB entry accessed. (See Section 15.4.4, “Data_access_exception Trap” on page 212 and TABLE 15-13 on page 224). The correction to the documentation is that all ASIs which access the D-MMU tlb have the same behavior, that is:
0x5C 0x5D 0x5E ASI_DTLB_DATA_IN_REG ASI_DTLB_DATA_ACCESS_REG ASI_DTLB_TAG_READ_REG
For instance, swapa [%g0] 0x5e, %g0 traps with ft[3:0] = 1000, if the mapping for VA==0x0 has cp==1 and cv==1.
Erratum 50:
RDPR of TPC, TNPC, or TSTATE may not bypass correctly into arithmetic instructions that create condition codes, causing incorrect V/C bypass/use. (Z and N are apparently always correct) The discovered failing instruction sequence is:
rdpr %tpc, %i0 subcc %i0, %g2, %i3
The 65th bit of the ALU used in the 2nd instruction can be incorrect. This should only affect the setting of the V and C flags by that instruction. It may also affect an integer divide that uses the result of the rdpr. The code above might be used when software is checking for a range of PC values and uses the V or C flag to do a less-than, greater-than comparison. The problem may exist for rdpr’s of other trap state.
474 UltraSPARC-IIi User’s Manual • October 1997
The problem occurs on instructions that use the first-level shortloop into the diad 65 bit ALU on operands whose results are generated from the iexe_aludp1_aluout_65_e bus. On second level and later conflicts the 65th bit was stripped off and shortlooped back in as zero. Only the first level shortloop allows a one on bit 64 to be shortlooped back into a following instruction. The 65th bit can only be one either when information is read in from the trap_sr_e busses and sign extended into the 65th bit, or for a shift operation. There is a family of failures that can occur on any instruction following and using the results of a preceding instructions usage of the trap_sr_e results bus. The full range of rdpr/rdasr that could be of interest can be examined: for non-zero bit 63. (fp stuff excluded) rdpr of: TPC, TNPC, TSTATE, TT, TICK, TBA, PSTATE, TL, PIL, CWP, CANSAVE, CANRESTORE, CLEANWIN, OTHERWIN, WSTATE, and VER. and rdasr of: Y_REG, COND_CODE_REG, ASI_REG, TICK_REG, PERF_CONTROL_REG, PERF_COUNTER, DISPATCH_CONTROL_REG, GRAPHIC_STATUS_REG, SOFTINT_REG, TICK_CMPR_REG Since the MSB needs to be 1, not all of the above registers can cause the error (if they have bit 63 defined to be zero always), so apparently only rdpr of TPC, TNPC, TSTATE, TICK, and rdasr of TICK_REG, and PERF_COUNTER can cause this error. It appears further that only reads from trap state are involved, that is, TPC, TNPC, or TSTATE.
Software workaround: Inhibit use of this bypass path by feeding the result of the rdpr through another operation before doing an instruction on it that sets condition codes or integer divides. That is, the example at the top could become:
rdpr %tpc, %i0 mov %i0,%i0 subcc %i0, %g2, %i3
Erratum 51:
IMU miss, with mispredicted CTI and delayed issue of delay slot, can cause instruction issue to stop.
Appendix K Errata
475
US-I, II, and IIi can stop issuing instructions (but be interruptible by XIR, and possibly other enabled trap conditions) due to a condition created, in one case, by this instruction sequence in an older Solaris interrupt trap handler:
STXA
using ASI in the range 0x46-0x5f, 0x76 or 0x77 (possibly any store)
JMPL MEMBAR #Sync
Apparently, the deadlock is most easily caused if the delay slot of the JMPL is a MEMBAR #Sync, or any instruction that synchronizes on the load or store buffers being empty. It appears that a delayed issue of the delay slot instruction is required, with the delay being probably 8 cycles or more after the CTI instruction. The relevant part of all this is just causing the delay slot instruction issue to be delayed, in the presence of a mispredicted branch (the JMPL is mispredicted the first time it is installed into the I-cache). So there are more scenarios possible than those described. The “delayed issue” requirement apparently does not include “delayed due to fetching the delay slot instruction”. It may also be possible to create the condition if the JMPL is replaced by other control transfer instructions, for example, CALL or RETURN or possibly any CTI. However, they must be mispredicted. There are a number of other conditions related to hits on I-cache state that are also apparently required. The easiest way to get an IMU miss, for typical code execution scenarios, is when using a predicted VA from the Return Address Stack (RAS). This appears to be why the JMPL sequence exposes the problem. Also, it appears that the predicted information for the target may need to be a pc-relative branch, and that the predicted information may need to be marked invalid in the I-cache predecode RAM. Note that the VAs in question are all predicted, and the combination of the predicted VA from the RAS, and a predicted branch displacement may result in a VA that is never mapped, rather than just temporarily in the IMU. Since it is possible to trap out of this deadlock, it can only be detected as a performance loss, except when pstate.ie==0 and timer interrupts cannot occur. (for instance, in trap handlers).
Software workaround: Any code that
s
turns off pstate, that is, disabling timer interrupts, or
476 UltraSPARC-IIi User’s Manual • October 1997
s
is very performance sensitive and which carries the possibility of mispredicted JMPL or branches with delay slots whose issue can be delayed (there are many cases; note that “delayed because not fetched yet” must also be included)
must guarantee: No IMU miss on any predicted path for the prefetch PCs. This must be true for all behaviors of the RAS and the NFRAM, in generating predicted PCs, which may not reflect real execution. For the OS, this amounts to requiring the RAS be initialized with CALLs to its known IMU-hitting VA space, specifically, CALLs that have return PCs 4 G-bytes away from the boundary of its IMU-hit VA space. The 4 G-bytes requirement helps ensure that predicted JMP targets are still within the IMU-hitting VA space. Note that CALL instructions push onto the RAS before being issued, so it is possible for unexpected VAs to appear on the RAS, owing to predicted CALLs pointing to old I-cache pre-decode information. Note that user code can still cause this IMU stop scenario. Since it is interruptible, execution resumes at the next interrupt (or, in the worst case, at the time slice), and the stop is not detected.
Erratum 53:
Little-endian enabled integer LDD/STD do not register swap. This applies to pages with the IE bit set in the TSB entry for that page, or to ldda/ stda used with any of the "LITTLE" ASIs... that is: ASI_AS_IF_USER_PRIMARY_LITTLE ASI_AS_IF_USER_SECONDARY_LITTLE ASI_NUCLEUS_LITTLE ASI_PRIMARY_LITTLE ASI_SECONDARY_LITTLE ASI_SECONDARY_NOFAULT_LITTLE The V9 architecture requirement is given in Section 6.3.1.22 “Little-Endian Addressing Convention” on page 69-70 of The SPARC Architecture Manual, Version 9:
doubleword or extended word: For the deprecated integer load/store double
instructions (LDD/STD), two little-endian words are accessed. The word at the address specified in the instruction + 4 corresponds to the even register specified in the instruction. The word at the address specified in the instruction corresponds to the following odd-numbered register. Instead of this requirement, US-I, II and IIi link the word address specified in the instruction to the even register, always. The word address plus 4 is linked to the odd register always.
Appendix K Errata
477
Note that sections A27 and A53 of the of the The SPARC Architecture Manual, Version 9 describe the LDD/STD instructions as behaving similarly. Use the descriptions in section 6.3.1.2.2 of the Architecture manual for the exclusion for little-endian behavior.
K.3
Erratum 1171:
Errata created by UltraSPARC-IIi
Noncacheable load/store using PA[40:0] that maps to the unused PBM PCI Configuration Space (function!=0) can result in a deadlock. There are two situations:
s
The first is an “illegal” case. Noncacheable load/store with PA[40:0] in the range 0x1FE.0100.0100–0x1FE.0100.07FF, and the ASI is 0x77 or 0x7F (SDB CSRs). Note that these PAs are unspecified in this manual. Normally, unspecified addresses like this can alias to other CSRs—see Section 19.4.3, “DMA Error Registers” on page 330—but in this case a deadlock may occur. The second case is a noncacheable load or store to the range to the range 0x1FE.0100.0100–0x1FE.0100.07FF. This is the PBM’s PCI configuration space, for function!=0. The PBM has no valid CSRs for nonzero function ID.
s
The 2.1 PCI spec says that references to any unused configuration space should be a no-op.
478 UltraSPARC-IIi User’s Manual • October 1997
Glossary
This glossary defines some important words and acronyms used throughout this manual. Italicized words within definitions are further defined elsewhere in the list.
alias ASI clean window Two virtual addresses are aliases of each other if they refer to the same physical address. Abbreviation for Address Space Identifier. A clean register window is one in which all of the registers contain either zero or a valid address from the current address space or valid data from the current address space. A set of protocols guaranteeing that all memory accesses are globally visible to all caches on a shared-memory bus. See coherence. A set of translations used to support a particular address space. See also MMU. The process of copying back a cache line in response to a hit while snooping. Cycles per instruction. The number of clock cycles it takes to execute one instruction. The block of 24 r registers to which the Current Window Pointer (CWP) register points. To invalidate a mapping in the MMU. To issue a fetched instruction to one or more functional units for execution. Accesses by a master on the secondary bus to a target on the primary bus. Equivalent to upstream. One of the floating-point condition code fields fcc0, fcc1, fcc2, or fcc3.
coherence consistency context copyback CPI current window demap dispatch DMA fccN
479
floating-point exception
An exception that occurs during the execution of an FPop instruction while the corresponding bit in FSR.TEM is set to 1. The exceptions are: unfinished_FPop, unimplemented_FPop, sequence_error, hardware_error, invalid_fp_register, and IEEE_754_exception. A floating-point exception, as specified by IEEE Std 754-1985. The specific type of a floating-point exception, encoded in the FSR.ftt field. An aspect of the architecture that may legitimately vary among implementations. In many cases, the permitted range of variation is specified in the SPARC-V9 standard. When a range is specified, compliant implementations shall not deviate from that range. instruction set architecture: an ISA defines instructions, registers, instruction and data memory, the effect of executed instructions on the registers and memory, and an algorithm for controlling instruction execution. An ISA does not define clock cycle times, cycles per instruction, data paths, etc. A key word indicating flexibility of choice with no implied preference. Memory Management Unit: a mechanism that implements a policy for address translation and protection among contexts. See also virtual address, physical address, and context. A master or slave device that attaches to the shared-memory bus. A register that contains the address of the instruction to be executed next, if a trap does not occur. An adjective that describes (1) the state of the processor when PSTATE.PRIV = 0, i.e., non-privileged mode; (2) processor state that is accessible to software while the processor is in either privileged mode or non-privileged mode; e.g., non-privileged registers, non-privileged ASRs, or, in general, nonprivileged state; (3) an instruction that can be executed when the processor is in either privileged mode or non-privileged mode. The mode in which processor is operating when PSTATE.PRIV = 0. See also privileged. The number of register windows present in a particular implementation. A feature not required for SPARC-V9 compliance. Peripheral Component Interconnect (bus). A high-performance 32 or 64-bit bus with multiplexed address and data lines.
floating-point IEEE-754 exception floating-point trap type implementationdependent
ISA
may MMU
module next program counter (nPC) non-privileged
non-privileged mode NWINDOWS optional PCI
480
UltraSPARC-IIi User’s Manual • October 1997
physical address PIO prefetchable
An address that maps real physical memory or I/O device space. See also virtual address. Accesses by a master on the primary bus to a target on the secondary bus. Equivalent to downstream. A memory location for which the system designer has determined that no undesirable effects will occur if a PREFETCH operation to that location is allowed to succeed. Typically, normal memory is prefetchable. Non-prefetchable locations include those that, when read, change state or cause external events to occur. For example, some I/O devices are designed with registers that clear on read; others have registers that initiate operations when read. See side effect.
privileged
An adjective that describes (1) the state of the processor when PSTATE.PRIV = 1, that is, privileged mode; (2) processor state that is only accessible to software while the processor is in privileged mode; e.g., privileged registers, privileged ASRs, or, in general, privileged state; (3) an instruction that can be executed only when the processor is in privileged mode. The processor is operating in privileged mode when PSTATE.PRIV = 1. A register that contains the address of the instruction currently being executed by the IU. Reset, Error, and Debug state. The processor is operating in RED_state when PSTATE.RED = 1. An adjective used to describe an address space identifier (ASI) that may be accessed only while the processor is operating in privileged mode. Used to describe an instruction field, certain bit combinations within an instruction field, or a register field that is reserved for definition by future versions of the architecture. A reserved field should only be written to zero by software. A reserved register field should read as zero in hardware; software intended to run on future versions of SPARC-V9 should not assume that the field will read as zero or any other particular value. Throughout this document, figures illustrating registers and instruction encodings always indicate reserved fields with an em dash ‘—’. A vectored transfer of control to privileged software through a fixed-address reset trap table. Reset traps cause entry into RED_state. The integer register operands of an instruction. rs1 and rs2 are the source registers; rd is the destination register. A key word indicating a mandatory requirement. Designers shall implement all such mandatory requirements to ensure inter-operability with other SPARC-V9-conformant products. The key word “must” is used interchangeably with the key word shall.
privileged mode program counter (PC) RED_state restricted reserved
reset trap rs1, rs2, rd shall
Glossary 481
should
A key word indicating flexibility of choice with a strongly preferred implementation. The phrase “it is recommended” is used interchangeably with the key word should. A memory location is deemed to have side effects if additional actions beyond the reading or writing of data may occur when a memory operation on that location is allowed to succeed. Locations with side effects include those that, when accessed, change state or cause external events to occur. For example, some I/O devices contain registers that clear on read, others have registers that initiate operations when read. The process of maintaining coherency between caches in a shared-memory bus architecture. All cache controllers monitor (snoop) the bus to determine whether they have a copy of a shared cache block. A load operation (e.g., non-faulting load) that is carried out before it is known whether the result of the operation is required. These accesses typically are used to speed program execution. An implementation, through a combination of hardware and system software, must nullify speculative loads on memory locations that have side effects; otherwise, such accesses produce unpredictable results. Software that executes when the processor is in privileged mode. Translation Lookaside Buffer: A hardware cache located within the MMU, which contains copies of recently used translations. Technically, there are separate TLBs for the instruction and data paths; the I-MMU contains the iTLB and the D-MMU the dTLB. The desired translation is present in the on-chip TLB. The desired translation is not present in the on-chip TLB. A vectored transfer of control to supervisor software through a table, the address of which is specified by the privileged Trap Base Address (TBA) register. A value (for example, an ASI number), the semantics of which are not architecturally mandated and which may be determined independently by each implementation (preferably within any guidelines given). An aspect of the architecture that has deliberately been left unspecified. Software should have no expectation of, nor make any assumptions about, an undefined feature or behavior. Use of such a feature may deliver random results, may or may not cause a trap, may vary among implementations, and may vary with time on a given implementation. An architectural feature that is not directly executed in hardware because it is optional or is emulated in software. Synonymous with undefined.
side effect
snooping
speculative load
supervisor software TLB
TLB hit TLB miss trap
unassigned
undefined
unimplemented unpredictable
482
UltraSPARC-IIi User’s Manual • October 1997
unrestricted
An adjective used to describe an address space identifier (ASI) that may be used regardless of the processor mode; that is, regardless of the value of PSTATE.PRIV. An address produced by a processor that maps all system-wide, programvisible memory. Virtual addresses usually are translated by a combination of hardware and software to physical addresses, which can be used to access physical memory. The process of writing a dirty cache line back to memory before it is refilled.
virtual address
writeback:
Glossary 483
484
UltraSPARC-IIi User’s Manual • October 1997
Bibliography
General References
Books and Specifications
Weaver, David L., editor. The SPARC Architecture Manual, Version 8, Prentice-Hall, Inc., 1992. Weaver, David L., and Tom Germond, editors. The SPARC Architecture Manual, Version 9, Prentice-Hall, Inc., 1994. Institute of Electrical and Electronics Engineers (IEEE) 1985. IEEE Standard for Binary Floating-Point Arithmetic, IEEE Std 754-1985. New York: IEEE. Institute of Electrical and Electronics Engineers (IEEE) 1990. IEEE Std 1149.1-1990, IEEE Standard Test Access Port and Boundary-Scan Architecture. New York: IEEE. PCI Special Interest Group. April 1994. PCI Local Bus Specification, Revision 2.1. Portland, Oregon: PCI Special Interest Group.
Papers
Boney, Joel. “SPARC Version 9 Points the Way to the Next Generation RISC,” SunWorld, October 1992, pp. 100-105. Greenley, D., et. al., “UltraSPARC™: The Next Generation Superstar 64-bit SPARC,” 40th Annual CompCon, 1995.
485
Kaneda, Shigeo. “A Class of Odd-Weight-Column SEC-DED-SbED Codes for Memory System Applications.” IEEE Transactions on Computers, August 1984. Kohn, L., et. al., ”The Visual Instruction Set (VIS) in UltraSPARC™,” 40th annual CompCon, 1995. Tremblay, Marc. “A Fast and Flexible Performance Simulator for Microarchitecture Trade-off Analysis on UltraSPARC,” DAC 95 Proceedings. Zhou, C., et. al., “MPEG Video Decoding with UltraSPARC Visual Instruction Set,” 40th Annual CompCon, 1995.
Sun Microelectronics Publications
These books and papers are available in printed form, and some are also available through the World Wide Web (WWW). See “On Line Resources” below for information about the SME WWW pages.
Data Sheets
UltraSPARC-IIi Highly Integrated 64-bit RISC Processor, PCI Interface, SME1040: 805-0086-02 UltraSPARC-IIi Advanced PCI Bridge (APB™), SME2411: 805-0088-02
User’s Guides
UltraSPARC User’s Manual: 802-7220-01 UltraSPARC-I Reset/Interrupt/Clock Controller User’s Manual: 805-0167-01
Other Materials
UltraSPARC Nested Trap White Paper (STB0045) UltraSPARC Evaluating Processor Performance White Paper (STB0014) UltraSPARC-II Advanced Branch Prediction and Single Cycle Following White Paper (STB0023)
486 UltraSPARC-IIi User’s Manual • October 1997
UltraSPARC-II Advanced Memory Structure White Paper (STB0022) UltraSPARC-II White Paper (STB0114) UltraSPARC-II Prefetch White Paper (STB0116) UltraSPARC-II Multiple Outstanding Requests White Paper (STB0117)
How to Contact
Sun Microelectronics is a division of: Sun Microsystems, Inc. 901 San Antonio Road Palo Alto, CA, U.S.A. 94303 Tel: 800 681-8845
On Line Resources
The Sun Microelectronics Worldwide Web page is located at: http://www.sun.com/microelectronics It contains the latest information about the entire UltraSPARC-IIi product line, and may be used to download HTML, PostScript, or Acrobat PDF copies of the IIi data sheets.
Bibliography
487
488 UltraSPARC-IIi User’s Manual • October 1997
Index
NUMERICS
132Mhz, 83
A
A Class instructions, 374 ACC field of SPARC-V8 Reference MMU PTE, 208 accesses diagnostic ASI, 69 I/O, 73 physically noncacheable, 21 with side-effects, 71, 337 Accumulated Exception (aexc) field of FSR register, 193, 195 active test data register, 415 address alias, 19, 26, 40 illegal, 68 map, 36, 324, 330 physical, 23 translation, virtual-to-physical, 23, 24 Address Mask (AM), 186 field of PSTATE register, 35, 124, 162, 185, 212, 213, 215 Address Space Identifier (ASI), 35, 39, 335, 479 AFAR ECU, 254, 258 PCI DMA UE AFSR, 331 PCI DMA UE/CE, 330, 333 PCI PIO Write, 295 AFSR ECU, 251, 252, 258 PCI DMA CE, 330, 334
PCI DMA UE, 330 PCI PIO Write, 295 alias, 479 address, 19, 68 boundary, 68 boundary, minimum, 68 of prediction bits, illustrated, 343 alignaddr_offset field of GSR register, 138, 154, 155 ALIGNADDRESS instruction, 138, 154 ALIGNADDRESS_LITTLE instruction, 138, 154 aligning branch targets, 340 alignment instructions, 154 Alternate Global Registers, 202 Ancillary State Register (ASR), 52 annex register file, 16 annulled slot, 346 APB, 83 arbiter, see PCI arbitration conflict, 352 Arithmetic and Logic Unit (ALU), 9, 16 ARRAY16 instruction, 165 ARRAY32 instruction, 165 ARRAY8 instruction, 165 ASI field of SFSR register, 223 restricted, 39, 215, 335 ASI_AS_IF_USER_PRIMARY, 75, 214 ASI_AS_IF_USER_PRIMARY_LITTLE, 75 ASI_AS_IF_USER_SECONDARY, 75, 214 ASI_AS_IF_USER_SECONDARY_LITTLE, 75 ASI_ASYNC_FAULT_ADDRESS, 254 see also AFAR, ECU ASI_ASYNC_FAULT_STATUS, 252 see also AFSR, ECU
489
ASI_BLK_COMMIT_PRIMARY, 68, 69 ASI_BLK_COMMIT_SECONDARY, 68, 69 ASI_DCACHE_DATA, 393 ASI_DCACHE_TAG, 393 ASI_ECACHE Diagnostic Accesses, 394 ASI_ECACHE_TAG_DATA, 395, 396 ASI_ESTATE_ERROR_EN_REG, 250 CEEN field, 251 NCEEN field, 251 SAPEN field, 251 UEEN field, 251 ASI_ICACHE_INSTR, 388, 390, 391, 392 ASI_ICACHE_PRE_DECODE, 389 ASI_ICACHE_PRE_NEXT_FIELD, 391 ASI_ICACHE_TAG, 389 ASI_INT_ACK, 322 ASI_INTR_DISPATCH_STATUS, 122 ASI_INTR_RECEIVE, 123 ASI_LSU_CONTROL_REGISTER, 384 ASI_NUCLEUS, 75, 214, 217 ASI_NUCLEUS_LITTLE, 75, 217 ASI_PHYS_*, 219 ASI_PHYS_BYPASS_EC_WITH_EBIT, 213, 218, 224, 234 ASI_PHYS_BYPASS_EC_WITH_EBIT_LITTLE, 213 , 234 ASI_PHYS_USE_EC, 21, 75, 234 ASI_PHYS_USE_EC_LITTLE, 75, 234 ASI_PRIMARY, 75, 217, 223 ASI_PRIMARY_LITTLE, 75, 217, 223 ASI_PRIMARY_NO_FAULT, 76, 206, 213, 214, 215 ASI_PRIMARY_NO_FAULT_LITTLE, 76, 206, 213, 215 ASI_REG Ancillary State Register (ASR), 53 ASI_SDB_INTR, 122 ASI_SDB_INTR_W, 121 ASI_SDBH_CONTROL_REG, 257 ASI_SDBH_ERROR_REG, 256 ASI_SDBL_CONTROL_REG, 257 ASI_SDBL_ERROR_REG, 256 ASI_SECONDARY, 75 ASI_SECONDARY_LITTLE, 75 ASI_SECONDARY_NO_FAULT, 76, 206, 213, 214, 215 ASI_SECONDARY_NO_FAULT_LITTLE, 76, 206, 213, 215 ASIs that support atomic accesses, 74 Asynchronous Fault Address Register, see AFAR Asynchronous Fault Status Register, see AFSR
490 UltraSPARC-IIi User’s Manual • October 1997
atomic accesses, 74 accesses, supported ASIs, 74 accesses, with non-faulting ASIs, 75 instructions in cacheable domain, 74 load-store instructions, 69 avoiding the bus turn-around penalty, 355
B
band interleaved images, 135 band sequential images, 135 big-endian, 89 byte order, 35, 169 bit vector concatenation, xl block commit store, 20 copy, inner loop pseudo-code, 177 load, 372 load instructions, 1, 21, 69, 78, 172 memory access, 406 memory operations, 200 store, 372, 373, 374 store instructions, 1, 21, 78 block-transfer ASIs, 173 board-level interconnect testing and diagnosis, 409 boundary scan, 409 chain, 415 register, 415, 416, 417 branch mispredicted, 16 predicted not taken, 366 predicted taken, 366 prediction, 15, 345 likely not taken state, 345 likely taken state, 345 target alignment, 340 transformation to reduce mispredicted branches illustrated, 349 bus error, 79 during exit from RED_state, 270 turn-around, 355 turn-around penalty, avoiding, 355 turn-around time, 355 bypass ASI, 39, 218, 383 byte granularity, 356 Byte Mask
see UPA64S, Byte Mask byte-twisting, 89, 90, 91
C
C stage, 347, 369, 371 cache direct mapped, 352 flushing, 68 inclusion, 68 level-1, 67 level-2, 67 set-associative, 352 write-back, 67 Cache Access (C) Stage, 16 illustrated, 13 cache coherence protocol, 70 cache flush software, 69 cache line dirty, 483 invalidating, 69 cache miss, 370 impact, 2 cache timing, 371 cacheable accesses, 20, 70, 70, 370, 373 cacheable after non-cacheable accesses, 338 cacheable domain, 74 Cacheable in Physically Indexed Cache (CP) field of TTE, 207, 337 Cacheable in Physically Indexed Cache (PC) field of TTE, 197 Cacheable in Virtually Indexed Cache (CV) field of TTE, 207 cacheable space, 36 see also address map caching TSB, 209 CANRESTORE Register, 187, 363 CANSAVE Register, 187, 363 capacity misses, 353 CAS instruction, 75 CE, see ECC, CE clean window, 187, 479 clean_window trap, 56, 187 CLEANWIN Register, 187, 363 CLEANWIN register, 187 CLEAR_SOFTINT Ancillary State Register
(ASR), 125 CLEAR_SOFTINT register, 54, 124, 125 code space dynamically modified, 74 coherence, 479 unit of, 70 coherence domain, 70 coherency, 482 cache, 70 I-Cache, 20 color virtual, 68 concatenation of bit vectors symbol, xl COND_CODE_REG Ancillary State Register (ASR), 53 condition code generation, 16 -setting, dedicated hardware, 362 configuration and status registers see CSR space, see PCI, configuration space conflict-misses, 353 consistency, 479 between code and data spaces, 74 Context field of TTE, 206 ID (CT) field of SFSR register, 224 context, 479, 480 register, 216 Control Transfer instruction (CTI), 365, 366 conventions, textual, xxxix fonts and symbols, xxxix copybacks cache line, 479 corrected_ECC_error trap, 57 cost of mispredicted branch illustrated, 348 counter field of TICK register, 186 CPI, 479 cross call, 202 cross-block scheduling, 2 CSR, 90 endianness, 90 CSRs summary of new, 330 CTI couple, 342, 348 current memory model, 335
Index 491
window, 479 Current Exception (cexc) field of FSR register, 190, 193, 195 Current Little Endian (CLE) field of PSTATE register, 223 Current Window Pointer, 479 CWP Register, 182, 187, 263 cycles per instruction (CPI), 2, 2
D
DAC, see PCI, DAC Data 0 (D0) field of PIC register, 402 Data 1 (D1) field of PIC register, 402 data alignment, 351 data cache see D-cache data parity error see error, PCI, DPE Data Translation Lookaside Buffer (dTLB), 19, 263 illustrated, 4 data watchpoint, 383 physical address, 213, 384 virtual address, 213, 384 data_access_error trap, 56 data_access_exception trap, 39, 40, 41, 47, 56, 71, 74, 76, 122, 169, 174, 178, 179, 181, 185, 196, 197, 202, 206, 208, 211, 212, 213, 215, 219, 221, 223, 224, 229, 381, 388 data_access_MMU_miss trap, 196, 210, 212 data_access_protection trap, 208, 212, 213 D-cache, 16, 20, 80, 263, 352, 353, 354, 355, 356, 372, 373, 405 access statistics, 405 array access, 353 bypassing, 353 data access address, illustrated, 393 data access data, illustrated, 393 enable bit, 20 enable field of LSU_Control_Register, 385 flush, 68 hit, 16, 371 hit rate, 351 hit timing, 371 illustrated, 4 line, 351 load hit, 372 logical organization illustrated, 350 miss, 16, 371, 406
miss load, 372 miss, E-Cache hit timing, illustrated, 352, 353 miss, E-Cache hit timing, illustrated, 353 misses, 351, 353, 356 organization, 350 read hit, 405 sub-block, 351 tag access, 353 tag/valid access address, illustrated, 393 tag/valid access data, illustrated, 393 timing, 350 DCTI couple, 361 decode (D) Stage illustrated, 13 decode (D) stage, 15 default byte order, 35 deferred error, 74 trap, 80, 183 delay slot, 366, 369 and instruction fetch, 341 annulled, 368 delayed control transfer instruction (DCTI), 365 delay slot, 80, 366 delayed return mode, 371, 372 demap, 479 Demap Context operation, 232 dependency checking, 368 load use, 346 destination register, 481 diagnostic accesses, I-Cache, 215 ASI accesses, 69 Diagnostic (Diag) field of TTE, 207 diagnostics control and data registers, 381 DIMM see also Memory requirements, 36 Direct Pointer register, 228 direct-mapped cache, 25, 352 dirty cache line, 483 Dirty Lower (DL) field of FPRS register, 192 Dirty Upper (DU) field of FPRS register, 192 disabled MMU, 197 dispatch, 479 Dispatch Control Register MVX, 458 Dispatch Control register, 382, 458
492
UltraSPARC-IIi User’s Manual • October 1997
GS, 458 MS, 458 DISPATCH_CONTROL_REG register, 54 Dispatch0, 404 displacement flush, 68, 69 divider, 9 division algorithm, 187 division_by_zero trap, 56 DMA transfers, 20 D-MMU, 212, 214, 216 enable bit, 21, 218 domain, cacheable and noncacheable, 73 DONE instruction, 80, 202, 385 DPD see errors, PCI, Data Parity error Detected DRAM see EDO DRAM Dual Address Cycle see PCI,DAC dynamic branch prediction state diagram, illustrated, 346, 392 Dynamic Set Prediction, 387 dynamically modified code space, 74
E
E Stage, 371, 373 E-cache, 2, 20, 29, 69, 80, 167, 239, 263, 344, 351, 352, 353, 354, 355, 356, 361, 405 access statistics, 405 AFAR, 258 AFSR, 258 Data RAM, illustrated, 5 diagnostic access, 394 Error Enable Register, 240, 242, 250 executing code from, 344 flush, 68 line, 351 parity error, 240 scheduling, 353 SRAM, 370, 373 update, 337 E-cache Tag RAM, illustrated, 5 E-cache), 16 ECC, 419, 453, 454 see also AFAR, ECU or AFSR, ECU CE, 242 multi-bit error, 240 PCI DMA CE AFSR, 330, 334 PCI DMA UE AFSR, 330, 331
PCI DMA UE/CE AFAR, 330, 333 ECU AFAR, 254 see also E-cache edge handling instructions, 161 edge mask encoding, 162 little-endian, 163 EDGE16 instruction, 161 EDGE16L instruction, 161, 162 EDGE32 instruction, 161 EDGE32L instruction, 161, 162 EDGE8 instruction, 161 EDGE8L instruction, 161, 162 EDO DRAM, 59 see also Memory enable bit, D-MMU, I-MMU, 218 D-MMU (DM) field of LSU_Control_Register, 21, 385 Floating-Point (PEF) field of PSTATE register, 137, 382 I-MMU (IM) field of LSU_Control_Register, 385 endianness, 206 enhanced security environment, 186 error CE, 244 detection, 239 DMA ECC Errors, 247 E-cache Tag Parity Error, 243 instruction access error, 243, 244 IOMMU Translation Error, 247 PCI, 245 Data Parity error Detected, 245 Data Parity error Detected (DPD), 245 DPE, 245 PER, 245 system Error, 248 target abort, 246 reporting, 239 SDB Error Control Register, 257 summary, 249 time out, 241, 244 UE, 244 unreported, 250 error_state, 182 error_state processor state, 263 errors instruction access error, 243 E-Stage, 16
Index 493
E-stage, 16, 369, 371, 372, 373 illustrated, 13 stalls, 371 ESTATE_ERR_EN Register, 270 ESTATE_ERR_EN register, 201 exception handling, 239 execution stage see E-Stage EXPAND instruction, 145 extended (non-SPARC-V9) ASIs, 41 floating-point pipeline, 13 instructions, 1, 203 external cache see E-cache cache unit (ECU) illustrated, 4 power-down (EPD) signal, 180 Externally Initiated Reset (XIR), 186, 263 externally_initiated_reset trap, 56
F
FALIGNDATA instruction, 154, 155, 171 FAND instruction, 156 FANDNOT1 instruction, 156 FANDNOT1S instruction, 157 FANDNOT2 instruction, 157 FANDNOT2S instruction, 157 FANDS instruction, 156 Fast Back-to-Back cycles, see PCI, Fast Back-to-Back fast_data_access_MMU_miss trap, 57, 211, 212, 225 fast_data_access_protection trap, 57, 202, 211, 212, 228 fast_instruction_access_MMU_miss trap, 57, 202, 211, 212, 225 Fault Address field of SFAR, 226 Fault Type (FT) field of SFSR register, 71, 74, 76, 197, 213, 223, 381, 388 Fault Valid (FV) field of SFSR register, 225 fccN, 479 FCMPEQ instruction, 160 FCMPEQ16 instruction, 159 FCMPEQ32 instruction, 159 FCMPGT instruction, 160 FCMPGT16 instruction, 159 FCMPGT32 instruction, 159 FCMPLE instruction, 160 FCMPLE16 instruction, 159 FCMPLE32 instruction, 159
FCMPNE instruction, 160 FCMPNE16 instruction, 159 FCMPNE32 instruction, 159 Fetch (F) Stage, 15 illustrated, 13 FEXPAND instruction, 140 FEXPAND operation illustrated, 146 FFB_Config Register, 277, 278 fill_n_normal trap, 57 fill_n_other trap, 57 floating point and graphics instruction classes, 374 and graphics instructions, latencies, 378 condition code, 479 condition codes, 375 deferred trap queue (FQ), 195 exception, 480 exception handling, 190 IEEE-754 exception, 480 multiplier, 376 pipeline, 13 queue, 13 register file, 16, 17, 21 square root, 190 store, 374 trap type, 480 trap type (FTT) field of FSR register, 194, 480 Floating Point and Graphics Unit (FGU), 15, 16, 17 Floating Point Condition Code (FCC) 0 (FCC0) field of FSR register, 193, 194 1 (FCC1) field of FSR register, 193 2 (FCC2) field of FSR register, 193 3 (FCC3) field of FSR register, 193 field of FSR register in SPARC-V8, 194 Floating Point Registers State (FPRS) Register, 192 Floating Point Unit (FPU) illustrated, 4 flush D-Cache, 68 displacement, 68 FLUSH instruction, 72, 74, 80, 196, 385 FMUL16x16 instruction, 147 FMUL8SUx16 operation illustrated, 151 FMUL8ULx16 operation illustrated, 152 FMUL8x16 instruction, 147 operation illustrated, 149 FMUL8x16AL
494
UltraSPARC-IIi User’s Manual • October 1997
instruction, 147 operation illustrated, 150 FMUL8x16AU instruction, 147 operation illustrated, 150 FMULD16x16 instruction, 147 FMULD8SUx16 operation illustrated, 152 FMULD8ULx16 operation illustrated, 153 FNAND instruction, 156 FNANDS instruction, 156 FNOR instruction, 156 FNORS instruction, 156 FNOT1 instruction, 156 FNOT1S instruction, 156 FNOT2 instruction, 156 FNOT2S instruction, 156 FONE instruction, 156 FONES instruction, 156 fonts textual conventions, xxxix FOR instruction, 156 Force Parity Error Mask (FM) field of LSU_Control_Register, 385 formation of TSB pointers illustrated, 236 FORNOT1 instruction, 156 FORNOT1S instruction, 156 FORNOT2 instruction, 156 FORNOT2S instruction, 156 FORS instruction, 156 fp_disabled trap, 54, 56, 137, 138, 140, 141, 148, 155, 158, 160, 164, 169, 171, 174, 179, 382 fp_disabled_ieee_754 trap, 56 fp_exception_ieee_754 trap, 189, 194, 195 fp_exception_other trap, 56, 181, 189, 190, 191, 194, 195 FP_STATUS_REG Ancillary State Register (ASR), 53 FPACK16 instruction, 140, 141 operation illustrated, 142 FPACK32 instruction, 140, 143 operation illustrated, 144 FPACKFIX instruction, 136, 140, 144 operation illustrated, 145 FPADD16 instruction, 139 FPADD16S instruction, 139, 140 FPADD32 instruction, 139
FPADD32S instruction, 139, 140 FPMERGE instruction, 140 operation illustrated, 147 FPRS Register, 363 FPSUB16 instruction, 139 FPSUB16S instruction, 139, 140 FPSUB32 instruction, 139 FPSUB32S instruction, 139, 140 FPU Enabled (FEF) field of FPRS register, 137, 382 FQ, see floating-point deferred trap queue (FQ) frame buffer, 355 FSRC1 instruction, 156 FSRC1S instruction, 156 FSRC2 instruction, 156 FSRC2S instruction, 156 FXNOR instruction, 156 FXNORS instruction, 156 FXOR instruction, 156 FXORS instruction, 156 FZERO instruction, 156 FZEROS instruction, 156
G
G Stage, 372 global visibility, 73 Global (G) field of TTE, 206, 208 global registers, 10 alternate, 10 interrupt, 10 MMU, 10 normal, 10 granularity byte, 356 sub_block, 356 GRAPHIC_STATUS_REG register, 54 graphics data format, 135 data format, 8-bit, 135 data format, fixed (16-bit), 136 instructions, 372 status Register (GSR), 137 unit (GRU) illustrated, 4 Graphics Status Register (GSR), 382 group stage see G-stage
Index 495
group break, 365 grouping rules, general, 360 grouping stage see G-stage G-stage, 15, 369, 372, 373, 376 illustrated, 13 stall, 377 stall counts, 404
H
hardware errors, fatal, 80 interrupts, 202 table walking, 211 hardware_error floating point trap type, 195, 480 hardware_error floating-point trap type, 195 high water mark, for stores, 355
I
I/O access, 73, 78 control registers, 70 devices, 355 memory, 336 I-Cache illustrated, 4 I-cache, 15, 19, 263, 344, 354, 385, 387 access statistics, 405 coherency, 20 diagnostic accesses, 215 disabled in RED_state, 269 Enable field of LSU_Control_Register, 385 flush, 68 hit, 19 Instruction Access Address, 388 Instruction Access Address, illustrated, 388 Instruction Access Data, 389 illustrated, 389 miss, 406 miss latency, 344 miss processing, 343, 392 organization, 340 organization illustrated, 340, 388 Predecode Field Access Address, 390 Predecode Field Access Address illustrated, 390 Predecode Field Access Data, 390
Predecode Field LDDA Access Data illustrated, 390 Predecode Field STXA Access Data illustrated, 390 Tag/Valid Access Address illustrated, 389 Tag/Valid Access Data illustrated, 389 Tag/Valid Field Access Address, 389 Tag/Valid Field Access Data, 389 timing, 343 utilization, 347 IEEE Std 1149.1-1990, 409 IEEE Std 754-1985, 193 IEEE_754_exception floating-point trap type, 195, 480 IEU0 pipeline, 362 IEU1 pipeline, 362 IGN, 110, 314 II-cache miss, 361 illegal address aliasing, 68 illegal_instruction trap, 53, 54, 56, 124, 125, 169, 173, 174, 181, 185, 195, 197, 198, 202, 203 ILLTRAP instructions, 181 image compression algorithms, 1 processing, 1 I-MMU, 216 disabled, 79 disabled in RED_state, 269 Enable bit, 218 IMPDEP1 instruction, 138 impl field of VER register, 188 implementation dependency, xxxix dependent, 480 inclusion, 68 initialization requirements, 262 INO, 110, 314 INR, 108 instruction alignment for grouping logic, 341 block load, 1 block store, 1 breakpoint, 383 buffer, 15, 343, 344, 350, 360, 361, 363, 367 buffer illustrated, 4 cache see I-cache dispatch, 361 multicycle, 368
496
UltraSPARC-IIi User’s Manual • October 1997
prefetch, 74 prefetch buffers, 74 prefetch to side-effect locations, 79 prefetch, when exiting RED_state, 79 termination, 17 instruction grouping anti-dependency constraints, 360 input dependency constraints, 360 output dependency constraints, 360 read-after-write dependency constraints, 360 write-after-read dependency constraints, 360 write-after-write dependency constraints, 360 instruction set architecture, 480 Instruction Translation Lookaside Buffer (iTLB), 19, 263 illustrated, 4 misses, 345 instruction_access_error trap, 243, 244 instruction_access_error trap, 56, 79, 201, 270 instruction_access_exception trap, 56, 185, 208, 211, 212, 219, 224 instruction_access_MMU_miss trap, 210, 212, 224, 225 integer divider, 9 division, 187 multiplication, 187 multiplier, 9 pipeline, 13 register file, 17, 187, 362 Integer Core Register File (ICRF), 15 Integer Execution Unit (IEU), 9, 362 illustrated, 4 pipelines, 362 interleaved D-Cache hits and misses to same subblock, 354 interlocks, 15 internal ASI, 40, 79, 370, 373 store to, 80 interrupt, 313 Clear Interrupt Register, 318, 319 concentrator see RIC dispatch, 118, 121 errors, 115 fsm states, 117 Full Interrupt Mapping Registers, 318 global registers (IGR), 120, 200, 202 Group Number see IGN IGN, see IGN
Incoming Interrupt Vector Data Registers, 122 INO, see INO INR see INR Interrupt State Diagnostic Registers, 320, 321 Interrupt Vector Dispatch Register, 122 Interrupt Vector Receive Register, 123 level, 116, 315, 321 Number Offset, see INO packet, 202 Partial INR, 111 Partial Interrupt Mapping Registers, 316, 317 PCI INT_ACK Register, 322, 323 PIE, 108 priorities, 112, 117 PSTATE.IE, 114 pulse, 315 SB_DRAIN, see SB_DRAIN SB_EMPTY see SB_EMPTY sources, 114 summary, 119 theory of operation, 112 Interrupt Disable (INT_DIS) field of TICK register, 199 field of TICK_CMPR register, 124 Interrupt Enable (IE) field of PSTATE register, 199 Interrupt Globals (IG) field of PSTATE register, 120, 200, 201 INTERRUPT_GLOBAL_REG register, 55 interrupt_level_n trap, 56 interrupt_vector trap, 56, 120, 202 invalid_fp_register floating-point trap type, 195, 480 invalidating a cache line, 69 Invert Endianness (IE) bit, 40 (IE)field of TTE, 206 IOMMU, 95 block diagram, 96 bypass mode, 95, 100 CAM, 96 ERR, 97 ERRSTS, 97 S, 97 SIZE, 97 W, 97 Control Register, 98, 308 LRU_LCKEN, 308 LRU_LCKPTR, 308 MMU_DE, 309
Index 497
MMU_EN, 309 TBW_SIZE, 309 TSB_SIZE, 308 DAC, 99 Data RAM Diagnostic Access, 312 Demap, 105 Flush Address Register, 311 initialization, 106 locking, 310 lookup procedure, 99 MMU_EN, 98 modes, 98 PA, 98, 312 page sizes, 95 Pass-through Mode, 101 PIO/DMA access conflicts, 104 Pseudo-LRU replacement algorithm, 105 RAM, 98 C, 98, 312 U, 98, 312 V, 98, 312 replacement policy, 105 SAC, 98 Tag Compare Diagnostic Register, 313 TAG Diagnostics Access, 311 TBW_SIZE Translation Errors, 104, 247 Translation Storage Buffer, see TSB andIOMMU, TSB TSB, 95 Base Address Register, 102, 310 TSB Offset, 103 TSB_SIZE, 101 TTE, 97 CACHEABLE, 102 DATA_PA, 102 DATA_SIZE, 102 DATA_SOFT, 102 DATA_SOFT_2, 102 DATA_V, 102 DATA_W, 102 LOCALBUS, 102 STREAM, 102 VA, 97 ISA, 480 Issue Barrier (MEMBAR #Sync), 74 I-Tag Access Register, 212 iTLB miss handler, 206
J
JMPL to noncacheable target address, 79
K
kernel code, 124
L
LDD instruction, 198 LDDA instruction, 171, 173 LDDF_mem_address_not_aligned trap, 56, 198 LDQF instruction, 198 LDQFA instruction, 198 LDSTUB instruction, 75 LDUW instruction replaces SPARC-V8 LD, 351 leaf subroutine, 349 level interrupt see Interrupt, level level-1 cache, 19 flushing, 67 level-1 instruction cache, 387 level-2 cache, 20, 67 see alsoE-cache little endian, 89, 162 ASIs, 92, 171 byte order, 35, 169 load buffer, 2, 16, 17, 72, 80, 353, 354, 355, 370, 372, 373, 405 buffer illustrated, 4 hit bypassing load miss—not supported on UltraSPARC-I, 354 latencies, 354 outstanding, 373 store Unit (LSU), 213 store Unit (LSU) illustrated, 4 to the same D-Cache sub-block, 354 use dependency, 346 use stall, 376 use, stall counts, 404 loads, always execute in order, 353 Lock (L) field of TTE, 207 loop unrolling, 349 LSU_Control_Register, 19, 20, 21, 218, 269, 383, 384, 384
498
UltraSPARC-IIi User’s Manual • October 1997
illustrated, 385
M
M Class instructions, 374 mandatory SPARC-V9 ASRs, 53 manuf field of VER register, 188 mask field of VER register, 188 MAXTL, 182, 263 maxtl field of VER register, 188 maxwin field of VER register, 188 may, 480 mem_address_not_aligned trap, 289 mem_address_not_aligned trap, 56, 169, 171, 173, 174, 178, 179, 185, 211, 213, 221, 223, 351, 381 Mem_Control0, 277 11-bit Column Address, 280 accessing, 277 ECCEnable, 279 RefEnable, 279 RefInterval, 281 SIMMPresent, 280 Mem_Control1, 277, 282 accessing, 277 ARDC- Advance Read Data Clock, 283 CASRW- CAS assertion for read/write cycles, 285 CP - CAS Precharge, 286 CSR - CAS before RAS delay timing, 284 RAS assertion, 287 RCD - RAS to CAS Delay, 285 RP - RAS Precharge, 286 RSC-RAS after CAS delay timing, 287 suggested values, 288 MEMBAR #LoadLoad, 72, 336, 337 MEMBAR #LoadStore, 73, 73, 175, 373 MEMBAR #Lookaside, 70, 73, 336, 337, 338 MEMBAR #Lookaside vs MEMBAR #StoreLoad, 70 MEMBAR #MemIssue, 72, 73, 337, 338, 372, 373 MEMBAR #StoreLoad, 70, 72, 72, 81, 175, 336, 372, 373 MEMBAR #StoreStore, 73, 175, 197, 373 and STBAR, 73 MEMBAR #Sync, 40, 69, 72, 74, 80, 174, 175, 221, 223, 233, 373, 374 MEMBAR examples and memory ordering, 71 MEMBAR instruction, 71, 72, 79, 121, 338
MEMDATA see Memory see UPA64S, MEMDATA Memory detecting 11-bit column addresses, 399 memory, 59 access instructions, 168 address map, 63, 66 addressing, 62, 65 block diagram, 60, 61 detecting 11-bit column addresses, 399 detecting DIMM pair Size, 399 detecting DIMM size, 398 DIMM requirements, 36 ECC, 419, 453, 454 mapped I/O control registers, 70 model, 175, 335 ordering, 70, 71 probing, 397 RASX_L mapping, 63, 66 synchronization, 72 Memory Interface Unit (MIU) illustrated, 4 Memory Management Unit (MMU), 16, 23, 205, 480 illustrated, 4 software view, 26 Memory Model (MM) field of PSTATE register, 335 minimum alias boundary, 68 mispredicted branch, 16 control transfer, 367 miss handler iTLB, 206 Translation Lookaside Buffer (TLB), 69 missing TLB entry, 209 MMU, 480 behavior during RED_state, 218 behavior during reset, 218 bypass mode, 35, 234 demap, 231 demap context operation, 231, 233 demap operation format illustrated, 232 demap page operation, 231, 233 disabled, 197 dTLB Tag Access Register illustrated, 228 D-TSB Register illustrated, 226 generated traps, 211 global registers, 200, 202, 211 Globals (MG) field of PSTATE register, 200, 201 iTLB Tag Access Register illustrated, 228
Index 499
I-TSB Register illustrated, 226 page sizes, 23 requirements, compliance with SPARC-V9, 220 Synchronous Fault Address Register (SFAR) illustrated, 226 MMU_GLOBAL_REG register, 55 module, 480 Mondo vector see interrupt MOVX_ENABLE, 458 MUL8SUx16 instruction, 151 MUL8ULx16 instruction, 151 MUL8x16 instruction, 148 MUL8x16AL instruction, 150 MUL8x16AU instruction, 149 MULD8SUx16 instruction, 152 MULD8ULx16 instruction, 153 multicycle instructions, 368 Multiflow TRACE and Cydrome Cydra-5, 357 multiple bit ECC error, 240 see also ECC, UE multiplication algorithm, 187 multiplier, 9 Multi-Scalar Dispatch Control, 458 M-way set-associative TSB, 209
N
N1 stage, 16, 371 N1 stage illustrated, 13 N2 stage, 17, 368, 372 N2 stage illustrated, 13 N2 stage stall, 378 N3 stage, 17, 348, 372, 373 N3 stage illustrated, 13 NCEEN bit of ESTATE_ERR_EN register, 79 nested traps in SPARC-V9, 182 not supported in SPARC-V8, 182 next field aliasing between branches illustrated, 342 next program counter, 480 NFO bit in MMU, 76 NFO page attribute bit, 357 NO_FAULT ASI, 76 No-Fault Only (NFO) field of TTE, 206, 215 nonallocating cache, 350 nonblocking loads, 353 noncacheable, 20
accesses, 20, 70, 72, 370, 373 instruction prefetch, 79 space, 36 stores, 355 noncacheable space see also address map Noncorrectable Error Enable (NCEEN) field of ESTATE_ERR_EN register, 201, 270 nonfaulting ASIs, and atomic accesses, 75 nonfaulting load, 76, 197, 212, 357 and TLB miss, 76 nonprivileged, 480 mode, 480 Trap (NPT) field of TICK register, 186 nonrestricted ASI, 39 Non-Standard (NS) field of FSR register, 189, 190, 194 nontranslating ASI, 40, 383 normal ASI, 39 normal memory, 481 notational conventions see conventions, textual Notes bad TSB size/address combinations, 103 clearing the interrupt busy bit, 123 CSR aliasing with illegal addresses, 52 CSR endianness, 293, 300 CSR/DMA arbitration for IOMMU, 312 DIMM memory composite specification, 282 disabling refresh, 281 E-cache diagnostic access, 394 ECC check bit equation, 420 emulation, 288 endianness, 325 illegal address can alias to CSRs, 394 initializing memory control registers, 288 Interrupt Clear Registers, 320 Interrupt XMIT state if Valid not enabled, 117 IOMMU ERR and ERRSTS Control Register bits, 309 IOMMU multiple matches illegal, 312 IOMMU not true LRU, 105 IOMMU page sizes, 309 IOMMU Used bit, 312 MEMBAR #Sync after stores to CSRs, 250 no individual subsystem resets, 180 no SDB asic, 255 no timeouts possible for IOMMU tablewalk, 104 no UE forced on writeback parity error, 244 no Wakeup Reset support, 265
500
UltraSPARC-IIi User’s Manual • October 1997
no zeroing of incoming PCI AD bits, 329 no zeroing of outgoing PCI AD bits, 327 one-hot PCI ARB_PRIO needed, 295 PCI Bus Number, 326 PCI Configuration cycles with random byte enables, 85 PCI DAC, 330 PCI DMA CE Interrupt, 334 PCI DMA to UPA64S, 89 PCI DMA UE AFSR/AFAR loaded on IOMMU errors, 247 PCI DMA UE AFSR/AFAR loaded oni IOMMU errors, 332 PCI Memory Space, 327 PCI parity errors and PER, 245 PCI PIO data buffer diagnostic access, 299 PCI PIO Write AFAR, 297 potential race between IOMMU flush and DMA, 311 PSTATE.IE used to inhibit V8 style interrupts, 114 reading PCI configuration space registers, 302 re-enabling interrupts, 242 sequential action for E-cache diagnostic access, 395 short reset mode, 265 some interrupts skip RECEIVED state in fsm., 316 specifying CAS for memory read/write, 282 TPC, TNPC undefined after deferred trap, 240 UE AFSR/AFSAR loaded on IOMMU translation errors, 105 UE can over CE in ECU AFSR, 256 unimplemented reserved addresses (CSRs), 52 nPC, 480 nPC Register, 185 Nucleus code, 124 nucleus context, 178 Nucleus Context Register, 223 NWINDOWS, 187, 188, 480
accesses, 73 DMA writes and Interrupts, 109 see also PCI, DMA Write Synchronization Register see also SB_DRAIN or SB_EMPTY OTHERWIN Register, 187, 363 out of range violation, 227, 228, 232 virtual address, 184 virtual address, as target of JMPL or RETURN, 185 virtual addresses, 24 virtual addresses, during STXA, 221 outstanding loads, 373 store, 373 overflow exception, 190 Overwrite (OW) field of SFSR register, 225
P
P_NCWR_REQ, 337 P_REPLY see UPA64S,P_REPLY PA Data Watchpoint Register, 213 illustrated, 384 PA Watchpoint Address Register, 221 PA_watchpoint trap, 56, 169, 171, 174, 179, 383 pack instructions, 136, 138, 141 page number, physical, 23 number, virtual, 23 offset, 23 Size (Size) field of TTE, 206 size, encoding in Translation Table Entry (TTE), 206 parity error, 80 Parity Error Enable see error, PCI, PER or E-cache, Error Enable Register Partial Interrrupt Number Register, see interrupt, partial INR partial store ASI, 169 instruction, 168, 169, 200 to noncacheable address, 337 Partial Store Order (PSO) memory model, 335, 337
O
Observability Bus group select, 458 odd fetch to an I-Cache line illustrated, 342 optional, 480 ordering between cacheable accesses after noncacheable
Index 501
partitioned multiply instructions, 147 PBM, see PCI, PBM PC, 481 PC Ancillary State Register (ASR), 53 PCI address spaces, 38, 323, 330 Address/Data Stepping, 84 arbiter, 83, 87 ARB_PARK, 87 ARB_PRIO, 87 Bus Parking, 87 byte-twisting, 90, 91 see also little-endian Cache-line Wrap Addressing Mode, 84 commands generated, 87 commands ignored, 88 Configuration cycles, 85, 326 address, 325 Type 0, 325 Type 1, 325, 326 configuration cycles Type 0, 85 Type 1, 85 Configuration Space, 300, 325, 327 Base Class Code Register, 304 Bus Number, 306 Command Register, 303 Device ID, 302 header registers, 83, 301 Header Type Register, 306 Latency Timer Register, 305 Programming I/F Code Register, 304 Revision ID Register, 304 Status Register, 303, 332 Sub-class Code Register, 304 Subordinate Bus Number, 306 Unimplemented Registers, 306 Vendor ID, 302 Control/Status Register, 294 DAC, 99, 329 Data Parity error Detected see errors, PCI, Data Parity error Detected Diagnostic Register, 297 disconnects, 85 DMA CE AFSR, 330, 334 DMA Data Buffer Diagnostic Access, 299 DMA Data Buffer Diagnostics Access (72:64), 300 DMA UE AFSR, 330, 331
502 UltraSPARC-IIi User’s Manual • October 1997
DMA UE/CE AFAR, 330, 333 DMA Write Synchronization Register, 298 Dual Address Cycle see PCI,DAC Fast Back-to-Back cycles, 83, 86 I/O Space, 327, 328 IDSEL#, 326 interface, 83 interrupts see interrupt IOMMU bypass mode, 329 pass-through, 329 peer-to-peer mode, 329 Register, 308 translation mode, 329 see also IOMMU Linear Incrementing addressing mode, 85 little endian, 90 LOCK, 84 master-aborts, 85 Memory Space, 328 memory space, 327 PBM, 83 PBM, control and status registers, 292 peer to peer mode, 83 PIO Data Buffer Diagnostic Access, 299 PIO Write AFAR, 295, 297 PIO Write AFSR, 295, 296 prefetch effects, 89 retries, 84 SAC, 98, 328 Single Address Cycle see PCI,SAC special cycles, 85 subtractive decode, 84 system error, 248 target abort, 85, 246 Target Address Space Register, 298 time out, 245 transactions, 87 Type 0, see PCI, configuration cycles Type 1, see PCI, configuration cycles PContext field, 222 PCR Cycle_cnt function, 403 PCR DC_hit function, 405 PCR DC_ref function, 405 PCR Dispatch0_dyn_use function, 405 PCR Dispatch0_ICmiss function, 404 PCR Dispatch0_mispred function, 404
PCR Dispatch0_static_use function, 404 PCR EC_hit function, 406 PCR EC_ref function, 405 PCR EC_snoop_inv function, 406 PCR EC_snoop_wb function, 406 PCR EC_wb function, 406 PCR EC_write_hit_clean function, 406 PCR IC_hit function, 405 PCR IC_ref function, 405 PCR Instr_cnt function, 404 PCR/PIC operational flow illustrated, 403 PDIST instruction, 164 peer to peer mode see PCI, peer to peer mode PERF_CONTROL_REG ASR, 54 PERF_COUNTER register, 54 performance Control Register (PCR), 401 Control Register (PCR) illustrated, 402 counters, for monitoring I-Cache accesses and misses, 344 instrumentation, 401 Instrumentation Counter (PIC), 401 Instrumentation Counter (PIC) illustrated, 402 physical address (PA), 23, 479, 481, 483 data watchpoint, 384 Data Watchpoint Read Enable (PR) field of LSU_Control_Register, 387 Data Watchpoint Write Enable (PW) field of LSU_Control_Register, 387 field of TTE, 207 space, accessing, 35 space, size, 1 Physical Address Data Watchpoint Read Enable (PR) field of LSU_Control_Register, 387 physical memory, 483 physical page attribute bits, MMU bypass mode, 234 number, 23 physically indexed, physically tagged (PIPT) cache, 19, 20 physically noncacheable accesses, 21 PIE, see interrupt, PIE pipeline, 2, 3 9-stage, 13 decoupling, 80 extended floating-point, 13 floating-point, 13 flushing, 20
integer, 13 stages (detailed) illustrated, 14 stages illustrated, 13 stall, 15, 80 pixel compare instructions, 159 data, operations on, 1 ordering, 136 PMERGE instruction, 146 population count (POPC) instruction, 186 power down mode, 203 power on reset (POR), 35, 186, 262, 263, 270, 424 power_on_reset trap, 56 precise traps, 80, 183 prefetch and Dispatch Unit (PDU), 15, 16 and Dispatch Unit (PDU), illustrated, 4 unit, 2 PREFETCHA instruction, 197 prefetchable, 481 Primary Context Register, 216, 222 privilege violation, 225 privileged, 211, 481 (P) field of TTE, 208 (PR) field of SFSR register, 225 (PRIV) field of PCR register, 54, 401, 402 (PRIV) field of PSTATE register, 74, 208, 212, 213, 335, 480, 483 mode, 481 Privileged (PRIV) field of PSTATE register, 481 privileged_action trap, 53, 54, 56, 74, 121, 122, 123, 186, 211, 213, 215, 335, 401 privileged_opcode trap, 54, 56, 124, 125, 180, 199, 401 probing the address space, 39 processor front end components, 339 interrupt level (PIL), 124 interrupt level (PIL) field of PSTATE register, 124, 199 memory model, 175 program counter, 481 order, 72 PROM, 90 instruction fetches, 92 protection violation, 213 PSO memory model, 198
Index 503
mode, 70, 72 PSTATE, 175 global register selection encodings, 202 register, 200, 202, 363
Q
quad-precision floating-point instructions, 191 queue floating-point, 13 Not Empty (qne) field of FSR register, 195
R
rd, 481 read after write (RAW) hazard, 356 interaction with store buffer, 372 real memory, 336 Red Mode Trap Vector, 34, 182 RED_state, 20, 21, 79, 182, 202, 218, 219, 241, 269, 270, 271, 481, 481 default memory model, 335 exiting, 79, 201, 270 MMU behavior, 218 RED_state_exception trap, 56 Reference MMU, 26 specification, 23 register (R) Stage, 16 file annex, 16 floating point, 16, 17, 21 integer, 17 SFAR, 213 SFSR, 213 stage illustrated, 13 window, 9 Relaxed Memory Order (RMO), 357 memory model, 335, 337 requirements, initialization, 262 reserved, 481 fields in opcodes, 181 instructions, 181 reset, 269 B_POR, 264, 268 B_XIR, 264, 268
block diagram, 262 bus conditions, 266 effects, 266 memory control initialization, 397 POR, 180, 268 POWER_OK, 264 priorities, 269 Push-button Power On Reset, 264 Push-button XIR, 264 Reset Error, and Debug (RED) field of PSTATE register, 79, 201, 269, 270, 481 Reset_Control Register, 264, 267 SHUTDOWN, 180 SIR, 261 SOFT_POR, 265, 268 SOFT_XIR, 265, 268 Software Power On Reset, 265 Software-Initiated Reset, 261 trap, 481 WDR (Watchdog Reset), 261 Reset, Error, and Debug (RED) field of PSTATE register see reset, Reset, Error, and Debug (RED) field of PSTATE register Reset_Control Register see reset, Reset_Control Register restricted, 481 ASI see ASI, restricted RETRY instruction, 80, 202, 385 Return Address Stack (RAS), 349 after Power-On Reset, 270 in RED_state, 270 RIC chip, 33, 116 RISC architecture, 1 RMO memory model, 198 mode, 70, 72 RMTV, 34, 182 Rounding Direction (RD) field of FSR register, 194 rs1, 481 rs2, 481 RSTVaddr, 182, 271
S
S_REPLY see UPA64S, S_REPLY SAVE instruction, 187
504
UltraSPARC-IIi User’s Manual • October 1997
SB_DRAIN, 110 see also ordering SB_EMPTY, 109, 110 Scalable Processor Architecture see SPARC scalarity, 3 scale_factor field of GSR register, 138, 141, 142, 143, 144 scheduling, 199 SContext field, 223 SDB, 239 SDB Error Control Register, 257 SDB Error Register, 239 Secondary Context Register, 222 secure environment, 186 Select Code 0 (S0) field of PCR register, 402 Select Code 1 (S1) field of PCR register, 402 self-modifying code, 74, 196 and FLUSH, 74 sequence_error floating-point trap type, 195, 480 serial scan interface, 409 SET_SOFTINT (ASR) register, 54, 124, 125 SET_SOFTINT Register, 124 set-associative cache, 352 SFAR register, 213 SFSR register, 213 shall expressing requirement, 481 shared cache block, 482 TSB, 210 shift instructions—dedicated hardware, 362 short floating point load instruction, 170, 200 store instruction, 170, 200 should expressing requirement, 482 SHUTDOWN instruction, 180, 203 side effect, 70, 482 accesses, 78 attribute, 197 attribute, and noncacheability, 71 bit, 81 field of SFSR register, 224 field of TTE, 197, 207 sign extended virtual address fields, 25 signal monitor (SIGM) instruction, 183, 263 in non-privileged mode, 183 signed loads, 351 silent loads—equivalent to non-faulting loads, 357 single bit ECC error see ECC,CE snoop, 73, 269, 352, 354, 405, 482
hits, 479 store buffer ———, 336 SOFTINT (ASR) register, 124, 199 SOFTINT_REG Ancillary State Register (ASR), 54, 125 software cache flush, 69 defined (Soft) field of TTE, 207 defined (Soft2) field of TTE, 207 Initiated Reset (SIR), 183, 263 Interrupt (SOFTINT) field of SOFTINT register, 124 Interrupt (SOFTINT) register, 124 pipelining, 2 Translation Table, 25, 196, 208 software_initiated_reset trap, 56 source register, 481 dependency, 376 SPARC, xxxviii Architecture Manual, Version 9, xxxviii brief history, xxxviii International, address of, xxxix V8 compatibility, 73 V8 Reference MMU, 23, 26 V9 compliance, 181, 480 V9, architecture, xxxviii V9, UltraSPARC extensions, xxxix speculative load, 71, 197, 212, 482 support for, 2 to page marked with E-bit, 71 spill_n_normal trap, 57 spill_n_other trap, 57 split field of TSB register, 210, 227 spurious loads eliminating, 356 SRAM, 11, 29 STA, 332 stable storage, 68, 69 STBAR (SPARC-V8), 72 equivalent to MEMBAR #StoreStore, 73 STD instruction, 198 STDA instruction, 171, 173 STDF_mem_address_not_aligned trap, 56, 198 steady state loops, 346 store block commit, 20 buffer, 16 delayed by load, 81 dependency, 373
Index 505
high-water mark, 355 outstanding, 373 store buffer, 2, 17, 72, 81, 354, 355, 356, 357, 370, 372, 373 compression, 71, 81, 373, 406 compression—disabled for noncacheable accesses, 79 full condition, 356 illustrated, 4 merging, 78 snooping, 336, 337 virtually tagged, 73 STQF instruction, 198 STQFA instruction, 198 strong ordering, 71 sequential order, 336 sub-block granularity, 356 superscalar processor, 1 supervisor software, 482 supported traps, 56 SWAP instruction, 75 Synchronous Fault Address Register (SFAR), 226 Synchronous Fault Status Register (SFSR), 223 illustrated, 223 SYSADDR bus, 422, 429 see also UPA64S system PROM see PROM Trace (ST) field of PCR register, 402
T
Tag Access Register, 210, 227, 229 tag_overflow trap, 56 TAP, 409 controller, 410 controller, state diagram illustrated, 411 controller, state machine, 409 TBW_SIZE, see IOMMU, TBW_SIZE Tcc instruction, reserved fields, 181 TCK IEEE 1149.1 signal, 410 TDI IEEE 1149.1 signal, 410 TDO IEEE 1149.1 signal, 410 terminated instruction, 17 Test Access Port see TAP textual conventions see conventions, textual
thread scheduling, 199 three-dimensional array addressing instructions, 165 Tick Compare… see TICK_CMPR… Tick Interrupt… see TICK_INT… TICK register, 363 illustrated, 186 TICK_CMPR field of TICK register, 124, 199 TICK_CMPR_REG register, 54 TICK_INT, 125, 199 field of SOFTINT register, 124 TICK_REG Ancillary State Register (ASR), 53 time out, see error, time out TL Register, 363 TLB, 167, 196, 482 bypass operation, 234 data, 19 Data Access register, 230, 231 Data In register, 210, 230, 231 demap operation, 234 hit, 16, 25, 482 instruction, 19 miss, 16, 25, 208, 482 and non-faulting load, 76 handler, 69, 178, 206, 209, 210, 220 operations, 234 read operation, 235 reset, 219 Tag Read register, 231 translation operation, 234 write operation, 235 see also IOMMU, TLB TMS IEEE 1149.1 signal, 410 Total Store Order (TSO) memory model, 335, 336 translating ASI, 39, 383 Translation Lookaside Buffer see TLB Translation Storage Buffer see TSB Translation Table Entry see TTE trap, 482 global registers, 200 MMU generated, 211 registers, 10 resolution, 17 stack, 182, 201 state registers, 182 Trap Base Address (TBA) register, 482 Trap Enable Mask (TEM) field of FSR register, 189, 190, 193, 194, 195 trap_instruction trap, 57
506
UltraSPARC-IIi User’s Manual • October 1997
TRST_L IEEE 1149.1 signal, 410 TSB, 25, 178, 196, 206, 208, 226, 345 caching, 209 locked items, 211 miss handler, 210 offset, see IOMMU, TSB Offset organization, 209 pointer logic, 235 Pointer register, 229 Register, 209 Tag Target register, 210, 222 see also IOMMU, TSB TSB_Base, 227 TSB_Base field of TSB Register, 227 TSB_Size field of TSB register, 210, 227 TSO memory model, 198 mode, 70, 72 ordering, 70 TSTATE, 202 TTE, 205, 212 illustrated, 205 see also IOMMU, TTE
U
UART, 70 UE, see ECC, UE UltraSPARC extensions to SPARC-V9, xxxix UltraSPARC-I architecture, overview, 1 Data Buffer (UDB), illustrated, 5 extended instructions, 203 internal ASIs, 79 internal registers, 215 subsystem, illustrated, 5 trap levels illustrated, 183 UltraSPARC-I block diagram, 4 UltraSPARC-IIi, 20 unassigned, 482 undefined, 482 underflow exception, 190 unfinished_FPop floating-point trap type, 189, 190, 195, 480 unimplemented, 482 instructions, 181 unimplemented_FPop floating-point trap type, 191,
195, 480 unit of coherence, 70 Universal Asynchronous Receiver Transmitter (UART), 70 unpredictable, 482 unrestricted, 483 UPA_CONFIG register, 289 ELIM, 289 MID, 289 PCAP, 289 UPA64S byte addresses within quadword, 421 Byte Mask byte mask, 430 dead cycle, 428 interface, description, 33 MEMDATA, 426 dead cycle, 425 P_NCBRD_REQ, 422, 429 P_NCBWR_REQ, 423, 429 P_NCRD_REQ, 422, 424, 428, 429, 430 P_NCWR_REQ, 423, 428, 429, 430 P_REPLY, 423, 426 definitions, 424 encoding, 424 P_IDLE, 424 P_RASB, 422, 424 P_WAB, 424 P_WAS, 424 timing, 426 packet format, 429 S_REPLY, 424, 425, 426 assertion, 428 definitions, 425 encodings, 426 rules, 425 S_IDLE, 424, 425, 426 S_RBU, 422, 425 S_SRS, 425, 426 S_WAB, 425 strongly ordered by request, 425 timing, 426 S_SRS, 426 SYSADDR bus, 422 transaction types, 429 user thread termination, 80 User Trace (UT) field of PCR register, 401, 402, 403 UserTrace (UT) field of PCR register, 402
Index 507
V
VA Data Watchpoint register, 213, 384 illustrated, 384 VA out of range, 225 VA Watchpoint Address Register, 221 VA_tag field of TTE, 206 VA_watchpoint trap, 57, 169, 171, 174, 179, 383 Valid (V) field of TTE, 206 Version (ver) field of FSR register, 194 virtual address, 483 virtual address fields, sign extended, 25 out of range, 24 see also VA… space illustrated, 25, 184 space, size, 1 Virtual Address Data Watchpoint Read Enable (VR) field of LSU_Control_Register, 386 Virtual Address Data Watchpoint Write Enable (VW) field of LSU_Control_Register, 386 virtual color, 68 virtual noncacheable accesses, 20 virtual page number, 23 virtual to physical address mapping, 35 translation, 23, 335 translation illustrated, 24 translation, IOMMU, 99 virtual_address_data_watchpoint_mask, 386 virtually cacheable, 68 virtually indexed, physically tagged (VIPT), 350 virtually indexed, physically tagged (VIPT) cache, 19 virtually noncacheable, 68 virtually tagged store buffers, 73
illustrated, 13 Write-After-Read (WAR) hazard, 357 writeback, 483 write-through cache, 350 WSTATE Register, 363
X
X1 Stage, 16 illustrated, 13 X2 Stage, 17 illustrated, 13 X3 Stage, 17 illustrated, 13
Y
Y_REG Ancillary State Register (ASR), 53
W
W Stage, 363, 364, 365, 372 W1 Stage virtual stage, 367, 368 Watchdog Reset (WDR), 182, 263 watchdog_reset trap, 56 watchpoint trap, 213, 382 window_fill trap, 185 Writable (W) field of TTE, 208 Write (W) field of SFSR register, 225 Write (W) Stage, 17
508 UltraSPARC-IIi User’s Manual • October 1997